问题描述
在Python 2中,Unicode字符串可能包含unicode和字节:
In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
我知道这是绝对的不是应该写的在他自己的代码,但这是一个字符串,我必须处理。
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
上述字符串中的字节为ек
(Unicode \\\е\\\к $ c)的UTF- $ c>)。
The bytes in the string above are UTF-8 for ек
(Unicode \u0435\u043a
).
我的目标是获取一个包含Unicode中所有内容的unicode字符串,即Русскийек
( \\\Р\\\у\\\с\\\с\\\к\\\и\\\й \\\е\\\к
)。
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек
(\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
).
将其编码为UTF-8会生成
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
然后从UTF-8解码得到的字符串中包含字节,这是不好的:
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
但我发现了一个解决问题的方法:
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
这很好用,但是由于使用 eval
,
repr
,然后附加正则表达式unicode字符串表示。
This works fine but looks very hacky due to its use of eval
, repr
, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
推荐答案
不,他们可能不会。它们包含Unicode字符。
No, they may not. They contain Unicode characters.
在原始字符串中, \xd0
不是UTF的一部分-8编码。它是具有代码点208的Unicode字符。 u'\xd0'
== u'\\\Ð'
。它只是发生,在Python 2的Unicode字符串的 repr
更喜欢用 \x
即代码点
Within the original string, \xd0
is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0'
== u'\u00d0'
. It just happens that the repr
for Unicode strings in Python 2 prefers to represent characters with \x
escapes where possible (i.e. code points < 256).
没有办法查看字符串,并告诉 \xd0
byte应该是某些UTF-8编码字符的一部分,或者它本身实际上代表该Unicode字符。
There is no way to look at the string and tell that the \xd0
byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
但是,可以总是将这些值解释为编码的,你可以尝试写一些依次分析每个字符的东西(使用 ord
转换为代码点整数),解码字符< 256作为UTF-8,并传递字符> = 256,因为他们是。
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord
to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
这篇关于unicode Python字符串中的字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!