问题描述
我发现unicode有些奇怪,据我所知,如果我用"" + "string",类型会是unicode,但为什么它们的长度不同?
print len(u''+'New York\u200b')14打印类型(u''+'纽约\u200b')<输入'unicode'>打印 len(u'New York\u200b')9打印类型(u'纽约\u200b')<输入'unicode'>
我也试图摆脱\u200b,我认为它是unicode
text = u'New York\u200b'打印 text.encode('ascii', errors='ignore')纽约text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b
也得到了不同的结果,我真的很困惑!btw,我用的是python 2.7,是时候改成3.3了吗??提前致谢!!
>>>(u''+'New York\u200b').encode('utf-8')'纽约\\u200b'
如您所见,由于 'New York\u200b'
是 不是 unicode 字符串,所以 \u
转义 doesn't 有任何特殊含义,按字面解释,即作为 ASCII 字符序列 \
u
2
0
0
b
,因此字符串长度为14
.u''
仅将字符串转换为 unicode,但它不会 导致重新解释内容.将 u
放在文字之前会使 python 将其解释为转义符,因此是单个字符,因此字符串的长度为 9.
在你的第二个例子中:
text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b
这里的.encode
不修改字符串中的字符,它只将unicode
转换为str
.
如果你打印两个字符串的内容可能会更清楚
>>>print(u'New York\u200b') # 注意:\u200b 被解释为 unicode 字符纽约>>>打印(b'纽约\u200b'.decode('ascii'))纽约\u200b或者,如果您希望看到实际的 unicode 表示,请尝试使用代码点 9731:
>>>打印(u'纽约\u2603')纽约☃>>>打印(b'纽约\u2603'.解码('ascii'))纽约\u2603I found something really weird about unicode, in my understanding, if I u"" + "string", the type will be unicode, but why are their length different?
print len(u''+'New York\u200b')
14
print type(u''+'New York\u200b')
<type 'unicode'>
print len(u'New York\u200b')
9
print type(u'New York\u200b')
<type 'unicode'>
I also tried to get rid of \u200b, which I think it is unicode
text = u'New York\u200b'
print text.encode('ascii', errors='ignore')
New York
text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b
Also got different result, I am really confused!btw, I am using python 2.7, is it the time to change to 3.3?? Thanks in advance!!
>>> (u''+'New York\u200b').encode('utf-8')
'New York\\u200b'
As you can see, since 'New York\u200b'
is not a unicode string, the \u
escape doesn't have any special meaning and it is interpreted literally, i.e. as the sequence of ASCII characters \
u
2
0
0
b
, hence the string has length 14
. The u''
only converts the string to unicode, but it does not cause a re-interpretation of the contents. Putting the u
before the literal makes python interpret it as an escape, hence as a single character, hence the string is length 9.
In your second example:
Here the .encode
does not modify the characters in the string, it only converts from unicode
to str
.
It's probably clearer if you print the contents of the two strings
>>> print(u'New York\u200b') # note: \u200b interpreted as unicode character
New York
>>> print(b'New York\u200b'.decode('ascii'))
New York\u200b
Or if you prefer to see an actual unicode representation try with code point 9731:
>>> print(u'New York\u2603')
New York☃
>>> print(b'New York\u2603'.decode('ascii'))
New York\u2603
这篇关于相同unicode的python不同长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!