问题描述
以下是我用于从pdf提取文本的代码(使用iText for .Net版本7.0.4.0).我在测试过程中观察到的是,它仅通过提取大多数pdf的矩形内的内容才能很好地工作.但是对于其中的少数几个,它会提供pdf中的整行.我知道
The following is the code (using iText for.Net Version 7.0.4.0) that i am using for extracting the text from a pdf. What i have observed during my testing is it works well by only extracting the content within a rectangle for most of the pdf's. But for few of them it gives the entire line from the pdf. I know
但是我想了解pdf中的哪些参数将在iText中用于拆分文本.
But I want to understand what parameter in the pdf will be used in iText to split text.
var reader = new PdfReader( filePath );
PdfDocument pdfDoc = new PdfDocument( reader );
var addressRect = new Rectangle( 33, 190, 70, 42 ); //
var addressRegionFilter = new TextRegionEventFilter( addressRect );
var filterListener = new FilteredTextEventListener( new LocationTextExtractionStrategy(), addressRegionFilter );
var addressText = PdfTextExtractor.GetTextFromPage( pdfDoc.GetPage( 1 ), filterListener );
pdfDoc.Close();
推荐答案
这应该可以解决问题.
class RectangleTextExtractionStrategy implements ITextExtractionStrategy
{
private ITextExtractionStrategy innerStrategy = null;
private Rectangle rectangle;
public RectangleTextExtractionStrategy(ITextExtractionStrategy strategy, Rectangle rectangle)
{
this.innerStrategy = strategy;
this.rectangle = rectangle;
}
@Override
public String getResultantText() {
return innerStrategy.getResultantText();
}
@Override
public void eventOccurred(IEventData iEventData, EventType eventType) {
if(eventType != EventType.RENDER_TEXT)
return;
TextRenderInfo tri = (TextRenderInfo) iEventData;
for(TextRenderInfo subTri : tri.getCharacterRenderInfos())
{
Rectangle r2 = new CharacterRenderInfo(subTri).getBoundingBox();
if(intersects(r2))
innerStrategy.eventOccurred(subTri, EventType.RENDER_TEXT);
}
}
private boolean intersects(Rectangle rectangle)
{
// # TODO
return true;
}
@Override
public Set<EventType> getSupportedEvents() {
return innerStrategy.getSupportedEvents();
}
}
这里的想法是将所有传入的TextRenderInfo对象拆分为对应于其字符的事件.然后(如果它们在搜索区域中)我们将调用委派给另一个ITextExtractionStrategy.
The idea here is to split all incoming TextRenderInfo objects into the corresponding events for their characters. Then (if they are in the search region) we delegate the call to another ITextExtractionStrategy.
这篇关于使用iText(.Net)从矩形中提取文本确实给了我整行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!