相同unicode的python不同长度

本文介绍了相同unicode的python不同长度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我发现unicode有些奇怪，据我所知，如果我用"" + "string"，类型会是unicode，但为什么它们的长度不同?

print len(u''+'New York\u200b')14打印类型(u''+'纽约\u200b')<输入'unicode'>打印 len(u'New York\u200b')9打印类型(u'纽约\u200b')<输入'unicode'>

我也试图摆脱\u200b，我认为它是unicode

text = u'New York\u200b'打印 text.encode('ascii', errors='ignore')纽约text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b

也得到了不同的结果，我真的很困惑！btw，我用的是python 2.7，是时候改成3.3了吗??提前致谢！！

解决方案

>>>(u''+'New York\u200b').encode('utf-8')'纽约\\u200b'

如您所见，由于 'New York\u200b' 是不是 unicode 字符串，所以 \u 转义 doesn't 有任何特殊含义，按字面解释，即作为 ASCII 字符序列 \ u 2 0 0 b，因此字符串长度为14.u'' 仅将字符串转换为 unicode，但它不会导致重新解释内容.将 u 放在文字之前会使 python 将其解释为转义符，因此是单个字符，因此字符串的长度为 9.

在你的第二个例子中:

text = u''+'纽约\u200b'打印 text.encode('ascii', errors='ignore')纽约\u200b

这里的.encode不修改字符串中的字符，它只将unicode转换为str.

如果你打印两个字符串的内容可能会更清楚

>>>print(u'New York\u200b') # 注意:\u200b 被解释为 unicode 字符纽约>>>打印(b'纽约\u200b'.decode('ascii'))纽约\u200b

或者，如果您希望看到实际的 unicode 表示，请尝试使用代码点 9731:

>>>打印(u'纽约\u2603')纽约☃>>>打印(b'纽约\u2603'.解码('ascii'))纽约\u2603

I found something really weird about unicode, in my understanding, if I u"" + "string", the type will be unicode, but why are their length different?

print len(u''+'New York\u200b')
14
print type(u''+'New York\u200b')
<type 'unicode'>
print len(u'New York\u200b')
9
print type(u'New York\u200b')
<type 'unicode'>

I also tried to get rid of \u200b, which I think it is unicode

text = u'New York\u200b'
print text.encode('ascii', errors='ignore')
New York
text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b

Also got different result, I am really confused!btw, I am using python 2.7, is it the time to change to 3.3?? Thanks in advance!!

解决方案

>>> (u''+'New York\u200b').encode('utf-8')
'New York\\u200b'

As you can see, since 'New York\u200b' is not a unicode string, the \u escape doesn't have any special meaning and it is interpreted literally, i.e. as the sequence of ASCII characters \ u 2 0 0 b, hence the string has length 14. The u'' only converts the string to unicode, but it does not cause a re-interpretation of the contents. Putting the u before the literal makes python interpret it as an escape, hence as a single character, hence the string is length 9.

In your second example:

Here the .encode does not modify the characters in the string, it only converts from unicode to str.

It's probably clearer if you print the contents of the two strings

>>> print(u'New York\u200b')  # note: \u200b interpreted as unicode character
New York
>>> print(b'New York\u200b'.decode('ascii'))
New York\u200b

Or if you prefer to see an actual unicode representation try with code point 9731:

>>> print(u'New York\u2603')
New York☃
>>> print(b'New York\u2603'.decode('ascii'))
New York\u2603

这篇关于相同unicode的python不同长度的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！