本文介绍了如何解锁“受保护"的服务器(受保护的)PDF格式的Python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中,我正在使用 pdfminer 从pdf中读取带有此消息下方代码的文本.我现在收到一条错误消息,提示:

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

当我使用Acrobat Pro打开此pdf文件时,事实证明它是安全的(或已读保护").通过此链接,我了解到有许多服务可以禁用此读取保护很容易(例如 pdfunlock.com .当深入研究pdfminer的来源时,我发现上面的错误是在.

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

由于有许多服务可以在一秒钟内禁用此读取保护,所以我认为这确实很容易.似乎.is_extractabledoc的简单属性,但我不认为它像将.is_extractable更改为True一样简单.

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

有人知道如何使用Python禁用pdf上的读取保护吗?欢迎所有提示!

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================ =

================================================

在下面,您将找到我目前用于从未读保护中提取文本的代码.

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result

推荐答案

据我所知,在大多数情况下,PDF的全部内容实际上都是使用密码作为加密密钥来加密的,因此只需设置 True不会对您有帮助.

As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.

每个线程:

是否存在用于从PDF删除密码的库以编程方式?

我建议使用诸如qpdf之类的命令行工具删除读取保护(易于安装,例如,如果尚未安装,请在Ubuntu上使用apt-get install qpdf):

I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

然后使用pdfminer打开未锁定的文件,然后执行您的工作.

Then open the unlocked file with pdfminer and do your stuff.

对于纯Python解决方案,您可以尝试使用PyPDF2及其.decrypt()方法,但是它不适用于所有类型的加密,因此,实际上,最好使用qpdf-看到:

For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:

https://github.com/mstamy2/PyPDF2/issues/53

这篇关于如何解锁“受保护"的服务器(受保护的)PDF格式的Python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 10:06