问题描述
我正在尝试读取文件并将字符串转换为UTF-8
字符串,以便删除文件字符串中的某些非utf-8
字符,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
但是我遇到了以下错误,
AttributeError: 'str' object has no attribute 'decode'
更新:我尝试了答案所建议的代码,
file_str = open(file_path, 'r', encoding='utf-8').read()
,但是它没有消除非utf-8
字符,那么如何删除它们呢?
删除.decode('utf8')
调用.您的文件数据已被 解码,因为在Python 3中,使用文本模式(默认)的open()
调用返回了一个文件对象,该文件对象将为您将数据解码为Unicode字符串 >.
您可能要做,希望将编码添加到open()
调用中以使其明确.否则,Python使用系统默认值,并且可能不是UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
例如,在Windows上,几乎可以肯定的是,对于UTF-8数据而言,默认编解码器肯定是错误的,但是直到尝试阅读文本时,您才会看到问题.您会发现您有 Mojibake ,因为使用CP1252或类似工具对UTF-8数据进行了解码8位编解码器.
有关更多详细信息,请参见 open()
函数文档./p>
I am trying to read a file and convert the string to a UTF-8
string, in order to remove some non utf-8
chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8
chars, so how to remove them?
Remove the .decode('utf8')
call. Your file data has already been decoded, because in Python 3 the open()
call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open()
call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open()
function documentation for further details.
这篇关于读取文件并尝试删除所有非UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!