本文介绍了规范化不保留代码点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以解释一下为什么从U + 2126(Ω)和U + 03A9(Ω)的NFD标准化导致相同的表示,不保留代码点?我会预期这种行为NFKD和NFKC(和字符与变音符号)只有。

  result1 = unicodedata.normalize NFD,u\\\Ω)
result2 = unicodedata.normalize(NFD,u\\\Ω)
print(NFD:+ repr(result1))
print(NFD:+ repr(result2))

输出:

  NFD:u'\\\Ω'
NFD:u'\\\Ω'
/ pre>

解决方案

这些被称为单例分解,存在于像U + 2126在Unicode中与现有标准兼容。它们不是兼容性分解(如U + 1D6C0

Can anyone please explain me why the NFD normalization from U+2126 (Ω) and U+03A9 (Ω) results in the same representation and does not preserve the code point? I would have expected this behaviour for NFKD and NFKC (and for characters with diacritics) only.

result1 = unicodedata.normalize("NFD", u"\u2126")
result2 = unicodedata.normalize("NFD", u"\u03A9")
print("NFD: " + repr(result1))
print("NFD: " + repr(result2))

Output:

NFD: u'\u03a9'
NFD: u'\u03a9'
解决方案

These are known as "singleton decompositions", and exist for characters like U+2126 (Ω) that are present in Unicode for compatibility with existing standards. They are not "compatibility decompositions" (like U+1D6C0

这篇关于规范化不保留代码点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 00:48