问题描述
我使用itext将pdf转换为文本文件,
它实际上运行良好但是对于某些单词它执行以下操作:
例如在pdf中有短语如呈现主要想法但是itext创建了一个像
presentthemainideas的输出。反正有没有纠正这种行为?
I use a itext for converting pdf to text file,it works good actually but for some words it do the following thing:for example in pdf there is phrase like "present the main ideas" but itext creates an output like"presentthemainideas". Is there anyway to correct this behaviour?
String pdf="/home/can/Downloads/NLP/textSummarization/A New Approach for Multi-Document Update Summarization.pdf";
String txt="/home/can/myWorkSpace/PDFConverterProject/outputs/bb.txt";
StringBuffer text=new StringBuffer() ;
String resultText="";
PdfReader reader;
try {
reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
text.append(strategy.getResultantText());
}
resultText=text.toString();
resultText = resultText.replaceAll("-\n", "");
out.println("-->"+resultText);
StringTokenizer stringTokenizer=new StringTokenizer(resultText, "\n");
PrintWriter lineWriter = new PrintWriter(new FileOutputStream("/home/can/myWorkSpace/PDFConverterProject/outputs/line.txt"));
while (stringTokenizer.hasMoreTokens()){
String curToken = stringTokenizer.nextToken();
lineWriter.println("line-->"+curToken);
}
lineWriter.flush();
lineWriter.close();
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
推荐答案
此类空格字符丢失的原因是您在渲染的PDF中看到的空间不一定对应于PDF的页面内容描述中的空格字符。相反,你经常在PDF中找到一个操作,在渲染一个单词后,将当前位置稍微向右移动,然后再渲染下一个单词。
The reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.
不幸的是同样的机制也用于增强相邻字形的外观:在某些字母组合中,为了获得良好的外观和阅读体验,字形应打印得彼此更近或彼此之间的距离比默认情况下更远。这是在PDF中使用与上面相同的操作完成的。
Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.
因此,在这种情况下,PDF解析器必须使用启发式方法来确定这种转换是否意味着暗示空间特征或者它是否仅仅意味着使字母组看起来很好。启发式算法可能会失败。
Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.
您使用 SimpleTextExtractionStrategy
作为文本提取策略。在这种情况下的启发式实现如下(当前在:
You useSimpleTextExtractionStrategy
as text extraction strategy. The heuristics in this case are implemented like this (as currently in therenderText
method in SimpleTextExtractionStrategy.java in the iText SVN trunk):
float spacing = lastEnd.subtract(start).length();
if (spacing > renderInfo.getSingleSpaceWidth()/2f)
{
result.append(' ');
}
因此,差距至少是当前宽度的一半作为空格字符,被翻译成空格字符。
Thus, a gap which is at least half as wide as the current width of as space character, is translated into a space character.
这通常听起来很明智。但是,对于仅使用水平移位来分隔单词的文档,实际空格字符的当前宽度可能不是启发式的好方法。
This generally sounds sensible. In case of documents, though, which only use horizontal shifts to separate words, the current widths of the actual space character may not be a good measure for the heuristics.
所以,你可以做的是尝试改进文本提取策略中的启发式。复制现有的,操作它,并在您的代码中使用它。
So, what you can do is try to improve the heuristics in the text extraction strategy. Copy the existing one, manipulate it, and use it in your code.
如果您为您的问题提供示例PDF,我们可能会有一些想法可以提供帮助。
If you supply a sample PDF for your issue, we might have some ideas to help.
这篇关于itext java pdf to text creation的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!