本文介绍了尝试使用pdfminer.six提取文本时,如何解决"UnicodeDecodeError"问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
使用pdfminer时,我收到UnicodeEncodeError( git的最新版本 )通过pip install git+https://github.com/pdfminer/pdfminer.six.git
安装:
I get a UnicodeEncodeError when using pdfminer (the latest version from git) installed via pip install git+https://github.com/pdfminer/pdfminer.six.git
:
Traceback (most recent call last):
File "pdfminer_sample3.py", line 34, in <module>
print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
File "pdfminer_sample3.py", line 27, in convert_pdf_to_txt
text = retstr.getvalue()
File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
我该如何解决?
#!/usr/bin/env python
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from StringIO import StringIO
import codecs
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp, pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt("samples/numbers-test-document.pdf"))
示例pdf
https://www.dropbox.com /s/khjfr63o82fa5yn/numbers-test-document.pdf?dl=0
推荐答案
将from StringIO import StringIO
替换为from io import BytesIO
和
将retstr = StringIO()
替换为retstr = BytesIO()
这篇关于尝试使用pdfminer.six提取文本时,如何解决"UnicodeDecodeError"问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!