问题描述
我正在使用python 2.7.12我有一个包含Unicode文字的字符串,该文字不是Unicode类型的.我想将其转换为文本.这个例子说明了我要做什么.
I am working with python 2.7.12I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.
>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False
我将如何转换此字符串?
How would I go about converting this string?
推荐答案
这不是UTF-8,而是UTF-16,尽管目前尚不清楚它是大字节序还是小字节序(您没有BOM,并且您有一个前导和尾随的NUL字节,使其长度不均匀).对于ASCII范围内的文本,UTF-8与ASCII是无法区分的,而UTF-16将NUL字节与ASCII编码的字节交替(如您的示例).
That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).
无论如何,转换为纯ASCII是相当容易的,您只需要以一种或另一种方式处理不均匀的长度:
In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:
s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')
# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')
当然,如果您的输入只是NUL散布的ASCII,并且您无法弄清楚字节顺序或如何获得偶数个字节,则可以作弊:
Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:
sascii = s.replace('\x00', '')
但是,如果输入是完全不同的编码,这不会引发异常,因此它可能会隐藏指定您期望捕获的错误.
But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.
这篇关于Python 2.7,将utf8字符串转换为ascii的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!