本文介绍了性能iText vs.PdfBox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我正在尝试将pdf(我最喜欢的书籍Effective Java,如果它的问题)转换为文本,我检查了iText和Apache PdfBox。我发现性能有很大差异:使用iText需要2:521,使用PdfBox:6:117。

I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117.This if my code for PdfBOx

PDFTextStripper stripper = new PDFTextStripper();


PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
  BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));


My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?



One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

但是库最初加载文档的方式也可能有所不同。在这里你可以试验一下,PDFBox不仅提供多个 PDDocument.load 重载,还有一些 PDDocument.loadNonSeq 重载(实际上 PDDocument.loadNonSeq 正确读取文档,而 PDDocument.load 可能被欺骗以误解PDF。所有这些不同的变体可能具有不同的运行时行为。

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

iText带来了一种简单而更高级的文本提取策略。简单的假设页面内容流中的文本以阅读顺序显示,而更高级的文本排序。默认情况下,使用更高级的一个。因此,您可以通过使用简单的策略来加速iText甚至更多。 PDFBox总是排序。

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

这篇关于性能iText vs.PdfBox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-22 21:22