问题描述
在 Python 中,是否有一种标准方法来规范化 unicode 字符串,使其仅包含可用于表示它的最简单的 unicode 实体?
我的意思是,可以将像 ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
这样的序列转换为 ['LATIN SMALL LETTER A WITH ACUTE']代码> ?
看看问题出在哪里:
>>>导入 unicodedata>>>字符 = "á">>>长度(字符)1>>>[ unicodedata.name(c) for c in char ]['带有 ACUTE 的拉丁文小写字母 A']但现在:
>>>字符 = "á">>>长度(字符)2>>>[ unicodedata.name(c) for c in char ]['拉丁文小写字母 A','结合急性口音']当然,我可以遍历所有字符并进行手动替换等,但效率不高,而且我很确定我会错过一半的特殊情况,并会出错.
unicodedata
模块提供了一个 .normalize()
函数,你想规范化为 NFC 形式.使用相同的 U+0061 LATIN SMALL LETTER
- U+0301 A COMBINING ACUTE ACCENT
组合和 U+00E1 LATIN SMALL LETTER A WITH ACUTE
您使用的代码点:
(我使用了 ascii()
函数 此处以确保使用转义语法打印非 ASCII 代码点,从而使差异清晰).
NFC 或Normal Form Composed"返回组合字符,NFD,Normal Form Decomposed"为您提供分解的组合字符.
附加的 NFKC 和 NFKD 形式处理兼容性代码点;例如U+2160 ROMAN NUMERAL ONE
实际上与 U+0049 LATIN CAPITAL LETTER I
相同,但存在于 Unicode 标准中以保持与单独处理它们的编码兼容.使用 NFKC 或 NFKD 形式,除了组合或分解字符外,还将用其规范形式替换所有兼容性"字符.
这是一个使用 U+2167 罗马数字八
代码点的示例;使用 NFKC 形式将其替换为 ASCII V
和 I
字符序列:
请注意,不能保证组合形式和分解形式是可交换的;将组合字符规范化为 NFC 形式,然后将结果转换回 NFD 形式并不总是会产生相同的字符序列.Unicode 标准维护一个例外列表;由于各种原因,此列表中的字符是可组合的,但不能分解回其组合形式.另请参阅关于组合排除表的文档.
Is there a standard way, in Python, to normalize a unicode string, so that it only comprehends the simplest unicode entities that can be used to represent it ?
I mean, something which would translate a sequence like ['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
to ['LATIN SMALL LETTER A WITH ACUTE']
?
See where is the problem:
>>> import unicodedata
>>> char = "á"
>>> len(char)
1
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A WITH ACUTE']
But now:
>>> char = "á"
>>> len(char)
2
>>> [ unicodedata.name(c) for c in char ]
['LATIN SMALL LETTER A', 'COMBINING ACUTE ACCENT']
I could, of course, iterate over all the chars and do manual replacements, etc., but it is not efficient, and I'm pretty sure I would miss half of the special cases, and do mistakes.
The unicodedata
module offers a .normalize()
function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER
- U+0301 A COMBINING ACUTE ACCENT
combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE
code points you used:
>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'
(I used the ascii()
function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).
NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.
The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE
is really just the same thing as U+0049 LATIN CAPITAL LETTER I
but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.
Here is an example using the U+2167 ROMAN NUMERAL EIGHT
codepoint; using the NFKC form replaces this with a sequence of ASCII V
and I
characters:
>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'
Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.
这篇关于规范化 Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!