问题描述
如果我有一个包含组合字符的 Python Unicode 字符串,len
会报告一个与看到"的字符数不对应的值.
If I have a Python Unicode string that contains combining characters, len
reports a value that does not correspond to the number of characters "seen".
例如,如果我有一个包含上划线和下划线的字符串,例如 u'A\u0332\u0305BC'
, len(u'A\u0332\u0305BC')代码>报告5;但显示的字符串只有 3 个字符长.
For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC'
, len(u'A\u0332\u0305BC')
reports 5; but the displayed string is only 3 characters long.
如何在 Python 中获取包含组合字形的 Unicode 字符串的可见"(即用户看到的字符串所占据的不同位置的数量)的长度?
How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?
推荐答案
unicodedata
模块 有一个函数 combining
可用于确定单个字符是否为组合字符.如果它返回 0
,您可以将该字符视为非组合字符.
The unicodedata
module has a function combining
that can be used to determine if a single character is a combining character. If it returns 0
you can count the character as non-combining.
import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
或者,稍微简单一点:
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
这篇关于我如何获得“可见"?Python中组合Unicode字符串的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!