为什么chardet说我的UTF-8编码的字符串(最初是从ISO-8859-1解码的)是ASCII?

本文介绍了为什么chardet说我的UTF-8编码的字符串(最初是从ISO-8859-1解码的)是ASCII?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将ascii字符转换为utf-8.下面的这个小示例仍然返回ascii字符:

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:

chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])

它返回:

{'confidence': 1.0, 'encoding': 'ascii'}

怎么来?

推荐答案

从Python的文档:

它可以处理任何Unicode代码点.

It can handle any Unicode code point.

一个Unicode字符串被转换为一个字节字符串，其中不包含任何嵌入的零字节.这样可以避免字节顺序问题，并且意味着UTF-8字符串可以由诸如strcpy()之类的C函数处理，并通过无法处理零字节的协议进行发送.

A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.

字符串ASCII文本也是有效的UTF-8文本.

所有ASCII文本也是有效的UTF-8文本. (UTF-8是ASCII的超集)

All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)

为清楚起见，请查看以下控制台会话:

To make it clear, check out this console session:

>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>

但是，并非所有具有UTF-8编码的字符串都是有效的ASCII字符串:

However, not all string with UTF-8 encoding is valid ASCII string:

>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>

因此，chardet仍然正确.仅当存在一个不是ascii的字符时，chardet才可以分辨出来，它不是经过ascii编码的.

So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.

希望这个简单的解释会有所帮助！

Hope this simple explanation helps!

这篇关于为什么chardet说我的UTF-8编码的字符串(最初是从ISO-8859-1解码的)是ASCII?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..