使用pdfminer(pdf2txt.py)处理PDF file (2.pdf)时,出现以下错误:

pdf2txt.py 2.pdf

Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/usr/local/bin/pdf2txt.py", line 109, in main
    interpreter.process_page(page)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = self.get_font(None, subspec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
    font = PDFCIDFont(self, spec)
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
    StringIO(self.fontfile.get_data()))
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

虽然类似的file (1.pdf)不会造成问题。

我找不到有关该错误的任何信息。我在pdfminer GitHub存储库上添加了issue,但仍未得到答复。有人可以向我解释为什么会这样吗?我该如何解析2.pdf

更新:直接从GitHub存储库中,在installing pdfminer之后,我收到BytesIO而不是StringIO的类似错误。
    $ pdf2txt.py 2.pdf
Traceback (most recent call last):
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
    self.init_resources(resources)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
    BytesIO(self.fontfile.get_data()))
  File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
    (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16

最佳答案

TL; DR

感谢@mkl和@hynecker提供的额外信息...这样,我可以确认这是pdfminer和您的PDF中的错误。每当pdfminer尝试获取嵌入式文件流(例如字体定义)时,它都会在endobj之前获取文件中的最后一个。遗憾的是,并非所有PDF都严格添加了结束标记,因此pdfminer应该对此具有弹性。

此问题的快速修复

我创建了一个补丁-已作为github上的拉取请求提交。参见https://github.com/euske/pdfminer/pull/159

详细诊断

正如其他答案中提到的那样,您看到的原因是由于pdfminer正在解压缩数据时,您没有从流中获得预期的字节数。但为什么?

如您在堆栈跟踪中所见,pdfminer(正确地)发现它具有要处理的CID字体。然后,它将继续处理嵌入的字体文件作为TrueType字体(在pdffont.py中)。它尝试通过读取一组二进制表来解析关联的流(流ID 18)。

这不适用于2.pdf,因为它具有文本流。您可以通过运行dumppdf -b -i 18 2.pdf看到它。我从这里开始:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...

所以,请进来,请进...这是您的文件或pdfminer中的错误吗?好吧,其他读者可以处理的事实使我感到怀疑。

深入研究,我发现此流的与流ID 17的相同,后者是ToUnicode字段的cmap。快速浏览PDF spec会发现它们不能相同。

进一步研究代码,我发现所有流都在获取相同的数据。糟糕!这是错误。原因似乎与以下事实有关:该PDF缺少某些结束标签-如@hynecker所指出的。

解决方法是为每个流返回正确的数据。仅仅吞下该错误的任何其他修复方法都将导致错误的数据被用于所有流,因此,例如,错误的字体定义。

我相信随附的补丁程序将解决您的问题,并且一般来说应该可以安全使用。

关于python - struct.error : unpack requires a string argument of length 16,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/40158637/

10-11 11:11