问题描述
我需要一次遍历一个 Python 字符串,但是一个简单的for"循环给了我 UTF-16 代码单元:
str = "abc\u20ac\U00010302\U0010fffd"对于 str 中的 ch:代码 = ord(ch)打印(U+{:04X}".格式(代码))
打印:
U+0061U+0062U+0063U+20ACU+D800U+DF02U+DBFFU+DFFD
当我想要的是:
U+0061U+0062U+0063U+20ACU+10302U+10FFFD
有没有办法让 Python 给我 Unicode 代码点的序列,而不管字符串实际上是如何编码的?我正在 Windows 上进行测试,但我需要可以在任何地方使用的代码.它只需要在 Python 3 上工作,我不在乎 Python 2.x.
到目前为止我能想到的最好的是:
导入编解码器str = "abc\u20ac\U00010302\U0010fffd"bytestr, _ = codecs.getencoder("utf_32_be")(str)对于范围内的 i (0, len(bytestr), 4):代码 = 0对于 bytestr[i:i + 4] 中的 b:代码=(代码<
但我希望有一种更简单的方法.
(对精确的 Unicode 术语的迂腐吹毛求疵将被无情地打在头上,一个线索四.我想我已经清楚地说明了我在这里追求的是什么,请不要在但是 UTF"上浪费空间-16 从技术上讲也是 Unicode"类型的参数.)
在具有窄 Unicode 构建的 Python 3.2.1 上:
PythonWin 3.2.1(默认,2011 年 7 月 10 日,21:51:15)[MSC v.1500 32 位(英特尔)] 在 win32 上.部分 版权所有 1994-2008 Mark Hammond - 有关更多版权信息,请参阅帮助/关于 PythonWin".>>>导入系统>>>sys.maxunicode65535
您发现了什么(UTF-16 编码):
>>>s = "abc\u20ac\U00010302\U0010fffd">>>镜片)8>>>对于 c in s:...打印('U+{:04X}'.format(ord(c)))...U+0061U+0062U+0063U+20ACU+D800U+DF02U+DBFFU+DFFD解决方法:
>>>导入结构>>>s=s.encode('utf-32-be')>>>struct.unpack('>{}L'.format(len(s)//4),s)(97, 98, 99, 8364, 66306, 1114109)>>>对于 struct.unpack('>{}L'.format(len(s)//4),s) 中的 i:...打印('U+{:04X}'.format(i))...U+0061U+0062U+0063U+20ACU+10302U+10FFFDPython 3.3 更新:
现在它按照 OP 期望的方式工作:
>>>s = "abc\u20ac\U00010302\U0010fffd">>>镜片)6>>>对于 c in s:...打印('U+{:04X}'.format(ord(c)))...U+0061U+0062U+0063U+20ACU+10302U+10FFFDI need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:
str = "abc\u20ac\U00010302\U0010fffd"
for ch in str:
code = ord(ch)
print("U+{:04X}".format(code))
That prints:
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD
when what I wanted was:
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.
The best I've been able to come up with so far is this:
import codecs
str = "abc\u20ac\U00010302\U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
code = 0
for b in bytestr[i:i + 4]:
code = (code << 8) + b
print("U+{:04X}".format(code))
But I'm hoping there's a simpler way.
(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)
On Python 3.2.1 with narrow Unicode build:
PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import sys
>>> sys.maxunicode
65535
What you've discovered (UTF-16 encoding):
>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
8
>>> for c in s:
... print('U+{:04X}'.format(ord(c)))
...
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD
A way around it:
>>> import struct
>>> s=s.encode('utf-32-be')
>>> struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>> for i in struct.unpack('>{}L'.format(len(s)//4),s):
... print('U+{:04X}'.format(i))
...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Update for Python 3.3:
Now it works the way the OP expects:
>>> s = "abc\u20ac\U00010302\U0010fffd"
>>> len(s)
6
>>> for c in s:
... print('U+{:04X}'.format(ord(c)))
...
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
这篇关于如何在 Python 3 中迭代 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!