使用 Python 3 的 readlines() 处理 Unicode 错误

本文介绍了使用 Python 3 的 readlines() 处理 Unicode 错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在读取文本文件时不断收到此错误.是否可以处理/忽略它并继续?

I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?

UnicodeEncodeError: ‘charmap’ 编解码器无法解码字节 0x81 的位置7827:字符映射到未定义.

推荐答案

在 Python 3 中，传递适当的 errors= 值(例如 errors=ignore 或 >errors=replace) 创建您的文件对象(假设它是 io.TextIOWrapper 的子类——如果不是，请考虑将其包装成一个！)；此外，请考虑传递比 charmap 更有可能的编码(当您不确定时，utf-8 始终是一个不错的起点).

In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

例如:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

在 Python 2 中，read() 操作只返回字节；然后，诀窍是将它们解码以将它们转换为字符串(如果您这样做，实际上，需要字符而不是字节).如果您对它们的真实编码没有更好的猜测:

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...替换未处理的字符，或

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

简单地忽略它们.

也就是说，最好找到并使用他们的真实编码(而不是猜测utf-8).

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.

这篇关于使用 Python 3 的 readlines() 处理 Unicode 错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！