本文介绍了是否有一个C ++库从PDF文件(如PDFBox for Java)中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 去年,我在Java使用PDFBox获取一些PDF文件中的原始文本的应用程序,我需要将该应用程序移植到C ++。 我想要知道什么是最好的C ++替代方法来完成我需要的。 我会举一个例子,以防它帮助: 大多数文件将如下所示: http://www.jumbala.net/backup/league.pdf 使用PDFBox,使用该文件,第2页上读取的每行和第3页的大部分将输出一行的所有数据, 因此,第2页中的第一个相关行将如下所示: FB 847 - Tremblay,Gérard179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615 或类似的东西,因为他们出现的顺序有轻微的变化,但我不在乎,只要相似的行输出相同,因为我只是解析他们并把我需要的值放在不同的变量中。 所以,知道这一切,是否有一个库可以在C ++程序中获得类似的结果? / p> 编辑:查看sacredFaith的链接 http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file 并尝试它,我得到一个奇怪的输出像我前面提到的示例文件: http://www.jumbala.net/backup/league.pdf.txt 零件我实际需要的是在开始的奇怪的字符。使用Adobe Acrobat Reader X并使用另存为...文本(可访问),我得到以下结果: http://www.jumbala.net/backup/league_good.pdf.txt 这是大约是我在Java中使用PDFBox获取的内容,以及我想要在C ++中作为输出获得什么。解决方案 Xpdf 是一个C ++应用程序/库,其中包含从PDF文件中提取纯文本的工具。 Last year, I made an application in Java using PDFBox to get the raw text in some PDF files and I need to port that application to C++ now.I wanted to know what was the best C++ alternative to accomplish what I need.I'll give an example in case it helps:Most files will look like this: http://www.jumbala.net/backup/league.pdfWith PDFBox, using that file, each line read on page 2 and most of page 3 would output all the data of a line, separated by a space instead of keeping it in a grid like it is now.So the first relevant line in page 2 would look like this:FB 847 - Tremblay, Gérard 179,63 56 16167 90 268 s27 p3 669 s14 199 223 193 615or something like that since there are minor changes in the order they appear, but I don't care about that as long as similar lines output the same since I just parse them and put the values I need in different variables.So, knowing all of that, is there a library that I can use in a C++ program to get similar results?Edit: After looking at sacredFaith's link at http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file and trying it, I'm getting a weird output like such for the example file I mentioned earlier:http://www.jumbala.net/backup/league.pdf.txtThe parts I actually need are in the weird characters at the beginning. Using Adobe Acrobat Reader X and using Save As... Text (accessible), I get the following result:http://www.jumbala.net/backup/league_good.pdf.txtWhich is approximately what I get in Java using PDFBox and what I want to get as output in C++. 解决方案 Xpdf is a C++ application/library which includes tools to extract plain text from a PDF file. 这篇关于是否有一个C ++库从PDF文件(如PDFBox for Java)中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-23 09:34