本文介绍了pdfminer python 3.5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经遵循了一些教程,但是我无法运行该代码块,我做了必要的从StringIO到BytesIO的切换(我相信吗?)

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)

我不确定为什么香蕉"什么都不印刷,我认为错误可能是红色鲱鱼?遵循python2.7教程并尝试将其翻译为python3,这与我有关系吗?

I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it something to do with me following a python2.7 tutorial and trying to translate it to python3?

errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
    banana = convert("A1.pdf")
  File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
    infile = file(fname, 'rb')
NameError: name 'file' is not defined

脚本

from io import BytesIO

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = BytesIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

banana = convert("A1.pdf")
print(banana)

此变体也会发生相同的事情:

The same thing happens with this variant:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)

我尝试搜索此文件(大多数pdfminer代码来自)但没有运气.

I have tried searching for this (most of the pdfminer code is from this or this) but having no luck.

任何见识都会受到赞赏.

Any insight is appreciated.

欢呼

推荐答案

Python 3.5的解决方案:您需要 pdfminer.six .在 win10 下,我可以轻松安装

There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with

pip install pdfminer.six

您可以使用以下方法检查已安装的版本

You can check the installed version with

pdfminer.__version__

我还没有对它进行深入的测试.但是我可以为转换 pdf→text pdf→html

I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html

这篇关于pdfminer python 3.5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 14:02