进一步查看代码,我发现所有流都在获取相同的数据.糟糕!这是错误.原因似乎与以下事实有关:该PDF缺少某些结束标签-如@hynecker所述.解决方法是为每个流返回正确的数据.仅仅吞下该错误的任何其他修复方法都将导致所有流使用错误的数据,例如,错误的字体定义.我相信随附的补丁程序可以解决您的问题,并且通常可以安全使用.While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:pdf2txt.py 2.pdfTraceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/usr/local/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents self.init_resources(resources) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font font = self.get_font(None, subspec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font font = PDFCIDFont(self, spec) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__ StringIO(self.fontfile.get_data())) File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__ (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))struct.error: unpack requires a string argument of length 16While the similar file (1.pdf) doesn't cause a problem.I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository. $ pdf2txt.py 2.pdfTraceback (most recent call last): File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main interpreter.process_page(page) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents self.init_resources(resources) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font font = self.get_font(None, subspec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font font = PDFCIDFont(self, spec) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__ BytesIO(self.fontfile.get_data())) File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__ (name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))struct.error: unpack requires a string argument of length 16 解决方案 TL; DRThanks to @mkl and @hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.Quick fix for this issueI've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.Detailed diagnosisAs mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:/CIDInit /ProcSet findresource begin12 dict beginbegincmap/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0>> def /CMapName /Adobe-Identity-UCS def...So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by @hynecker.The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.I believe the attached patch will fix your problem and should be safe to use in general. 这篇关于struct.error:解压缩需要长度为16的字符串参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
07-09 03:09