问题描述
我的任务是计算输入中感知到的字符数.输入是一个由整数组成的组(我们可以将其视为一个int[]
),表示Unicode代码点.
java.text不允许使用.BreakIterator.getCharacterInstance(). (我是说他们的公式是允许的,而且是我想要的,但是编织它们的源代码和状态表让我无处可去.
我想知道在给定某些代码点的情况下计算字素簇数量的正确算法是什么?
最初,我以为我要做的就是结合所有发生的事情:
-
U+0300 – U+036F
(结合变音符号) -
U+1DC0 – U+1DFF
(结合变音符号补充) -
U+20D0 – U+20FF
(将符号的变音符号组合在一起) -
U+FE20 - U+FE2F
(结合半角标记)
进入先前的非音素标记.
但是我意识到,在该操作之前,我必须先还要删除所有非字符.
这包括:
-
U+FDD0 - U+FDEF
-
每个平面的最后两个代码点
但是似乎还有更多事情要做. Unicode.org 指出,我们需要包含U+200C
(零宽度非连接符) )和U+200D
(零宽度连接符)(作为连续字符集的一部分)(源) .
除此之外,它还讨论了其他几件事,但是整个主题都以抽象的方式处理.例如,行距合并标记,构成韩文音节的韩文Jamo字符的代码点范围是什么?
在给定int[]
个代码点的情况下,没有人知道正确的算法来计算字素簇的数量吗?
没有一种适用于所有用途的规范方法,但是一个很好的起点是要链接到的Unicode.org页面上的Unicode Grapheme Cluster Boundary算法.基本上,Unicode提供了每个代码点的字素断裂属性的数据库,然后描述了一种算法,该算法根据分配给它们的字素断裂属性来决定两个代码点之间是否允许字素断裂.
这是我前一段时间玩过的实现(在C ++中)的一部分:
bool BoundaryAllowed(char32_t cp, char32_t cp2) {
// lbp: left break property; rbp: right break property
auto lbp = get_property_for_codepoint(cp),
rbp = get_property_for_codepoint(cp2);
// Do not break between a CR and LF. Otherwise, break before and after
// controls.
if ((CR == lbp && LF == rbp)) {
// The Unicode grapheme boundary algorithm does not handle LFCR new lines
return false;
}
if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
LF == rbp) {
return true;
}
// Do not break Hangul syllable sequences.
if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
((LVT == lbp || T == lbp) && (T == rbp))) {
return false;
}
// Do not break before extending characters.
if (Extend == rbp) {
return false;
}
// Do not break before SpacingMarks, or after Prepend characters.
if (Prepend == lbp || SpacingMark == rbp) {
return false;
}
return true; // Otherwise, break everywhere.
}
为了获得不同类型的代码点的范围,您只需查看Unicode字符数据库.带有字素断裂属性的文件(用范围来描述它们)长约1200行: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/
我不确定在忽略非字符代码点时有多少价值,但是如果您需要使用它,则可以将其添加到实现中.
I have the task of counting the number of perceived characters in an input. The input is a group of ints (we can think of it as an int[]
) which represents Unicode code points.
java.text.BreakIterator.getCharacterInstance() is not allowed. (I mean their formula is allowed and is what I wanted, but weaving through their source code and state tables got me nowhere >.<)
I was wondering what's the correct algorithm to count the number of grapheme-clusters given some code points?
Initially, I'd thought that all I have to do is to combine all occurences of:
U+0300 – U+036F
(combining diacritical marks)U+1DC0 – U+1DFF
(combining diacritical marks supplement)U+20D0 – U+20FF
(combining diacritical marks for symbols)U+FE20 - U+FE2F
(combining half marks)
into the previous non-diacritic-mark.
However I've realised that prior to that operation, I have to first remove all non-characters as well.
This includes:
U+FDD0 - U+FDEF
The last two code points of every plane
But there seems to be more things to do. Unicode.org states we need to include U+200C
(zero-width non joiner) and U+200D
(zero width joiner) as part of the set of continuing characters (source).
Besides that, it talks about a couple more things but the entire topic is treated in an abstract way. For example, what are the code point ranges for spacing combining marks, hangul jamo characters that forms hangul syllables?
Does anyone know the correct algorithm to count the number of grapheme-clusters given an int[]
of code points?
There's not a single canonical method appropriate to all uses, but a good starting point is the Unicode Grapheme Cluster Boundary algorithm on the Unicode.org page you link to. Basically, Unicode provides a database of each code point's grapheme break property, and then describes an algorithm to decide if a grapheme break is allowed between two code points based on their assigned grapheme break properties.
Here's part of an implementation (in C++) I played around with a while ago:
bool BoundaryAllowed(char32_t cp, char32_t cp2) {
// lbp: left break property; rbp: right break property
auto lbp = get_property_for_codepoint(cp),
rbp = get_property_for_codepoint(cp2);
// Do not break between a CR and LF. Otherwise, break before and after
// controls.
if ((CR == lbp && LF == rbp)) {
// The Unicode grapheme boundary algorithm does not handle LFCR new lines
return false;
}
if (Control == lbp || CR == lbp || LF == lbp || Control == rbp || CR == rbp ||
LF == rbp) {
return true;
}
// Do not break Hangul syllable sequences.
if ((L == lbp && (L == rbp || V == rbp || LV == rbp || LVT == rbp)) ||
((LV == lbp || V == lbp) && (V == rbp || T == rbp)) ||
((LVT == lbp || T == lbp) && (T == rbp))) {
return false;
}
// Do not break before extending characters.
if (Extend == rbp) {
return false;
}
// Do not break before SpacingMarks, or after Prepend characters.
if (Prepend == lbp || SpacingMark == rbp) {
return false;
}
return true; // Otherwise, break everywhere.
}
In order to obtain the ranges for the different types of codepoints you'll just have to look at the Unicode Character Database. The file with the grapheme break properties, which describes them in terms of the ranges, is about 1200 lines long: http://www.unicode.org/Public/6.1.0/ucd/auxiliary/
I'm not really sure how much value there is in ignoring non-character code points, but if your use requires it then you'll add that in to your implementation.
这篇关于确定用户感知字符数的正确算法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!