问题描述
我已经遵循了一些教程,但是我无法运行该代码块,我做了必要的从StringIO到BytesIO的切换(我相信吗?)
I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)
我不确定为什么香蕉"什么都不印刷,我认为错误可能是红色鲱鱼?遵循python2.7教程并尝试将其翻译为python3,这与我有关系吗?
I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it something to do with me following a python2.7 tutorial and trying to translate it to python3?
errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
banana = convert("A1.pdf")
File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
infile = file(fname, 'rb')
NameError: name 'file' is not defined
脚本
from io import BytesIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = BytesIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
banana = convert("A1.pdf")
print(banana)
此变体也会发生相同的事情:
The same thing happens with this variant:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)
我尝试搜索此文件(大多数pdfminer代码来自此或此)但没有运气.
I have tried searching for this (most of the pdfminer code is from this or this) but having no luck.
任何见识都会受到赞赏.
Any insight is appreciated.
欢呼
推荐答案
是 Python 3.5的解决方案:您需要 pdfminer.six .在 win10 下,我可以轻松安装
There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with
pip install pdfminer.six
您可以使用以下方法检查已安装的版本
You can check the installed version with
pdfminer.__version__
我还没有对它进行深入的测试.但是我可以为转换 pdf→text 和 pdf→html
I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html
这篇关于pdfminer python 3.5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!