问题描述
打开PDF文件时,下面的代码丢失,首页只有一列,其他页面只有1列。
My code below is lost when opening PDF file which has only one column on the front page and more than 1 column on other pages.
有人能告诉我我做错了什么吗?
我的代码下面:
Someone can tell me what I'm doing wrong?Below my code:
PdfReader pdfreader = new PdfReader(pathNmArq);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page=1; page <= lastPage; page++)
{
extractText = PdfTextExtractor.GetTextFromPage(pdfreader, page, strategy);
extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
/ / ...
}
推荐答案
您使用 SimpleTextExtractionStrategy
。此策略假定PDF中的文本绘图说明按阅读顺序排序。在您的情况下似乎并非如此。
You use the SimpleTextExtractionStrategy
. This strategy assumes that the text drawing instructions in the PDF are sorted by the reading order. In your case that does not seem to be the case.
如果您不能指望包含阅读顺序的绘图操作的PDF,但仅使用来自的阅读文字提取策略分发时,您必须知道构成单个列的区域。如果页面包含多个列,则必须使用 RegionTextRenderFilter
限制为列,然后使用 LocationTextExtractionStrategy
。
If you cannot count on the PDF containing drawing operations in reading order but are only using iText text extraction strategies from the distribution, you have to know areas which constitute a single column. If a page contains multiple columns, you have to use RegionTextRenderFilter
to restrict to a column and then use the LocationTextExtractionStrategy
.
PS:您的意图究竟是什么
extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
行?
这篇关于itextsharp - 读取包含1列(第1页)和第2列(第2页)的PDF的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!