问题描述
我发现许多帖子都提出了阅读pdf的解决方案.我想逐字阅读pdf文件并对其进行一些处理.人们建议使用pdfMiner,它将整个pdf文件转换为文本文件.但是我想要的是逐字阅读pdf.谁能建议一个可以做到这一点的图书馆?
I have found many posts where solutions to read pdfs has been proposed. I want to read a pdf file word by word and do some processing on it. people suggest pdfMiner which converts entire pdf file into text file. But what i want is that to read pdfs word by word. Can anyone suggest a library that does this??
推荐答案
可能最快的方法是首先使用 pdftotext (在pdfMiner的网站上,有一种说法是pdfMiner比pdftotext慢20倍),然后照常解析文本文件.
Possibly the fastest way to do this is to first convert your pdf inta a text file using pdftotext (on pdfMiner's site, there's a statement that pdfMiner is 20 times slower than pdftotext) and afterwards parse the text file as usual.
此外,当您说我想逐字读取pdf文件并对其进行处理"时,您未指定是要基于pdf文件中的单词进行处理,还是实际上想要修改pdf文件本身.如果是第二种情况,那么您手上将面临一个完全不同的问题.
Also, when you said "I want to read a pdf file word by word and do some processing on it", you didn't specify if you want to do processing based on words in a pdf file, or do you actually want to modify the pdf file itself. If it's the second case, then you've got an entirely different problem on your hands.
这篇关于Python读取PDF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!