pdf 文件中的文本是文本格式,不是扫描的。 PDFMiner不支持python3,有没有其他解决办法?
最佳答案
还有 pdfminer2 fork,支持 python 3.4,可通过 pip3 获得。
https://github.com/metachris/pdfminer
This thread 帮我修补了一些东西。
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO
def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
if __name__ == "__main__":
#scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
pdfFile = BytesIO(scrape.read())
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()
关于使用 Python3.4 提取 PDF 文本,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31023793/