Python：UnicodeDecodeError：'utf8'编解码器无法解码字节

本文介绍了Python：UnicodeDecodeError：'utf8'编解码器无法解码字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在读一堆RTF文件到python字符串。
在某些文本上，我收到这个错误：

 追溯（最近的最后一次呼叫）：
文件11.08.py，第47行，< module> 
 X = vectorizer.fit_transform（texts）
文件C：\Python27\lib\site-packages\sklearn\feature_extraction\text.py，行
 716 ，in fit_transform 
 X = super（TfidfVectorizer，self）.fit_transform（raw_documents）
文件C：\Python27\lib\site-packages\sklearn\feature_extraction\text.py ，line 
 398，in fit_transform 
 term_count_current = Counter（analyze（doc））
文件C：\Python27\lib\site- packages\sklearn\feature_extraction\\ \\ text.py，行
 313，在< lambda> 
 tokenize（preprocess（self.decode（doc））），stop_words）
文件C：\Python27\lib\site-packages\sklearn\feature_extraction\text.py $ line 
 
 doc = doc.decode（self.charset，self.charset_error）
文件C：\Python27\lib\encodings\utf_8.py ，第16行，解码
返回codecs.utf_8_decode（输入，错误，True）
 UnicodeDecodeError：'utf8'编解码器无法解码位置462的字节0x92：无效
开始字节

我试过：

将文件的文本复制并粘贴到新文件

将rtf文件保存为txt文件

打开txt文件记事本++并选择'转换为utf-8'，并将编码设置为utf-8

使用Microsoft Word打开文件并将其另存为新文件

没有任何工作。任何想法？

这可能不是相关的，但这里的代码是你想知道的：

  f = open（dir + location，r）
 doc = Rtf15Reader.read（f）
t = PlaintextWriter.write（doc）.getvalue（）
 .append（t）
 f.close（）
 vectorizer = TfidfVectorizer（sublinear_t_t = True，max_df = 0.5，stop_words ='english'）
 X = vectorizer.fit_transform（texts）

解决方案

正如我在邮件列表中所说，这可能是最简单的 charset_error 选项，并将其设置为 ignore 。
如果文件实际上是utf-16，您还可以在Vectorizer中将字符集设置为utf-16。
请参阅。

I'm reading a bunch of RTF files into python strings.On SOME texts, I get this error:

Traceback (most recent call last):
  File "11.08.py", line 47, in <module>
    X = vectorizer.fit_transform(texts)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
    term_count_current = Counter(analyze(doc))
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
    doc = doc.decode(self.charset, self.charset_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
 start byte

I've tried:

Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files

Nothing works. Any ideas?

It's probably not related, but here's the code incase you are wondering:

f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

解决方案

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore.If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.See the docs.

这篇关于Python：UnicodeDecodeError：'utf8'编解码器无法解码字节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！