问题描述
我正在尝试将ascii字符转换为utf-8.下面的这个小示例仍然返回ascii字符:
I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:
chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])
它返回:
{'confidence': 1.0, 'encoding': 'ascii'}
怎么来?
推荐答案
从Python的文档:
-
它可以处理任何Unicode代码点.
It can handle any Unicode code point.
一个Unicode字符串被转换为一个字节字符串,其中不包含任何嵌入的零字节.这样可以避免字节顺序问题,并且意味着UTF-8字符串可以由诸如strcpy()之类的C函数处理,并通过无法处理零字节的协议进行发送.
A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
字符串ASCII文本也是有效的UTF-8文本.
所有ASCII文本也是有效的UTF-8文本. (UTF-8是ASCII的超集)
All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)
为清楚起见,请查看以下控制台会话:
To make it clear, check out this console session:
>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>
但是,并非所有具有UTF-8编码的字符串都是有效的ASCII字符串:
However, not all string with UTF-8 encoding is valid ASCII string:
>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>
因此,chardet
仍然正确.仅当存在一个不是ascii的字符时,chardet
才可以分辨出来,它不是经过ascii编码的.
So, chardet
is still right. Only if there is a character that is not ascii, chardet
would be able to tell, it's not ascii encoded.
希望这个简单的解释会有所帮助!
Hope this simple explanation helps!
这篇关于为什么chardet说我的UTF-8编码的字符串(最初是从ISO-8859-1解码的)是ASCII?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!