读取文件并尝试删除所有非UTF-8字符

本文介绍了读取文件并尝试删除所有非UTF-8字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试读取文件并将字符串转换为UTF-8字符串，以便删除文件字符串中的某些非utf-8字符，

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

但是我遇到了以下错误，

AttributeError: 'str' object has no attribute 'decode'

更新:我尝试了答案所建议的代码，

file_str = open(file_path, 'r', encoding='utf-8').read()

，但是它没有消除非utf-8字符，那么如何删除它们呢?

解决方案

删除.decode('utf8')调用.您的文件数据已被解码，因为在Python 3中，使用文本模式(默认)的open()调用返回了一个文件对象，该文件对象将为您将数据解码为Unicode字符串 >.

您可能要做，希望将编码添加到open()调用中以使其明确.否则，Python使用系统默认值，并且可能不是UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

例如，在Windows上，几乎可以肯定的是，对于UTF-8数据而言，默认编解码器肯定是错误的，但是直到尝试阅读文本时，您才会看到问题.您会发现您有 Mojibake ，因为使用CP1252或类似工具对UTF-8数据进行了解码8位编解码器.

有关更多详细信息，请参见 open()函数文档./p>

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

but I got the following error,

AttributeError: 'str' object has no attribute 'decode'

Update: I tried the code as suggested by the answer,

file_str = open(file_path, 'r', encoding='utf-8').read()

but it didn't eliminate the non utf-8 chars, so how to remove them?

解决方案

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.

这篇关于读取文件并尝试删除所有非UTF-8字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！