问题描述
使用iTextSharp,我试图从以下pdf文件中提取文本:
这是代码:
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader,2,new SimpleTextExtractionStrategy());
if(currentText.Length> 0)
{
var capture = new Capture();
capture.Text = currentText;
//如果找到任何数据,请将结果写入数据库
_dataService.AddCapture(capture);
}
使用SimpleTextExtractionStrategy,结果被写入数据库,包含大量不需要的空格在言语中。第2页的前几行写为:
例如,参见第4和第4章中的JO INT一词。第6行,以及第2行到最后一行的CON CERN。这些类型的空间出现在整个结果中。不幸的是,这将使查询文本变得不可能。
有没有人知道为什么会这样做以及如何解决这个问题?
为什么会这样做
原因实际上是文本提取策略的一个特性,在您的情况下没有按预期工作。
一些背景:你认为PDF文件中的单词之间的空格不一定是由于指令而产生的绘制空格字符,它也可以是指令将文本插入位置向右移动的结果。因此,文本提取策略通常在找到像这样的足够大的右移时添加空格字符。对于这方面的更多内容(特别是足够大的部分),例如。
如果是您的文件,文本正文字体的字体宽度信息太小(如果按原样使用,字符会粘在一起,中间没有任何空格);因此,在每对连续字符之间存在小的右移,其中一些移位宽度足以通过上述机制错误地识别为字分离。
如何解决此问题
由于PDF中的单词分隔是通过绘制空格字符的指令创建的,因此您不需要上述功能。因此,解决该问题的最简单方法是使用没有该功能的文本提取策略。
您可以通过复制<$的源代码来创建此类策略c $ c> SimpleTextExtractionStrategy (例如来自)并注释掉方法 RenderText
中的一些行,如下所示:
public virtual void RenderText(TextRenderInfo renderInfo)
{
[...]
if(hardReturn)
{
//System.out.Println(\"<< Hard Return>>));
AppendTextChunk('\ n');
}
else if(!firstRender)
{
// if(result [result.Length - 1]!=''&& renderInfo.GetText()。长度>&& renderInfo.GetText()[0]!='')
// {//如果前一个字符串的尾随字符不是空格,我们只插入一个空格,并且当前字符串的前导字符不是空格
// float spacing = lastEnd.Subtract(start).Length;
// if(spacing> renderInfo.GetSingleSpaceWidth()/ 2f)
// {
// AppendTextChunk('');
// //System.out.Println(\"在''+ renderInfo.GetText()+'之前插入隐含空格);
//}
//}
}
else
{
//System.out.Println(\"Displaying first string of content'+ text +':: x1 =+ x1);
}
[...]
}
使用这种简化的提取策略,可以正确提取文本。
Using iTextSharp, I am trying to extract the text from the following pdf file:
https://www.treasury.gov/ofac/downloads/sdnlist.pdf
This is the code:
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy());
if (currentText.Length > 0)
{
var capture = new Capture();
capture.Text = currentText;
// write the results to the DB, if any data was found
_dataService.AddCapture(capture);
}
Using the SimpleTextExtractionStrategy, the results are written to the database with myriads of unwanted spaces within words. The first several lines of of page 2 write as:
See for example the word "JO INT" in the 4th & 6th lines, and the word "CON CERN" in the 2nd to last line. These types of spaces occur throughout the entire results. This will make querying the text impossible, unfortunately.
Does anyone have any idea why this does this and how to resolve this?
why this does this
The cause actually is a feature of the text extraction strategy which in your case does not work as desired.
A bit of background: What you perceive as a space between words in a PDF file does not necessarily come into being due to an instruction drawing a space character, it can also be the result of an instruction shifting the text insertion position a little to the right. Thus, text extraction strategies usually add a space character when finding a sufficiently large right-shift like that. For some more on this (in particular the "sufficiently large" part) confer e.g. this answer.
In case of your document, though, the text body font has too small font width information (if used as is, the characters appear glued together with no space in-between whatsoever); thus, there are small right shifts between each couple of consecutive characters, some of these shifts wide enough to be falsely identified as word separation by the mechanism explained above.
how to resolve this
As word separations in your PDF are created by instructions drawing a space character, you do not need the feature explained above. Thus, the easiest way to resolve the issue is to use a text extraction strategy without that feature.
You can create such a strategy by copying the source code of the SimpleTextExtractionStrategy
(e.g. from here) and comment out some lines from the method RenderText
as below:
public virtual void RenderText(TextRenderInfo renderInfo)
{
[...]
if (hardReturn)
{
//System.out.Println("<< Hard Return >>");
AppendTextChunk('\n');
}
else if (!firstRender)
{
// if (result[result.Length - 1] != ' ' && renderInfo.GetText().Length > 0 && renderInfo.GetText()[0] != ' ')
// { // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
// float spacing = lastEnd.Subtract(start).Length;
// if (spacing > renderInfo.GetSingleSpaceWidth() / 2f)
// {
// AppendTextChunk(' ');
// //System.out.Println("Inserting implied space before '" + renderInfo.GetText() + "'");
// }
// }
}
else
{
//System.out.Println("Displaying first string of content '" + text + "' :: x1 = " + x1);
}
[...]
}
Using this simplified extraction strategy, your text is properly extracted.
这篇关于iTextSharp在pdf文件中的单词中插入空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!