我正在尝试使用here所述的ICU Transliterator对某些文本进行非常具体的转换。

我的文字同时包含半角片假名字符和常规拉丁字符。我想将半角片假名转换为全角片假名,同时保持非片假名字符不变。

我只想简单地将标准“半角-全角” ICU变换与仅选择片假名的过滤器一起应用,但这是行不通的-片假名过滤器不适用于Halfwidth Katakana Voiced Sound Mark,这令我感到惊讶。我试图弄清楚这是故意还是错误。

有关示例,请参见下面的代码。我试过了:


Halfwidth-Fullwidth,没有过滤器-影响太大
带片假名过滤器的Halfwidth-Fullwidth-不影响U + ff9e-这是一个错误吗?
Halfwidth-Fullwidth带负拉丁过滤器-仍会影响空格。
Halfwidth-Fullwidth复合负滤镜-太脆弱。


有任何想法吗?
在哪里可以检查ICU [:Katakana:]过滤器中实际包含哪些字符?

void transliterateWithRules(String original, String rules) {
    Transliterator transliterator = Transliterator.createFromRules("mytest", rules, Transliterator.FORWARD);
    String result = transliterator.transliterate(original);
    System.out.println(String.format("Transliteration result: \"%s\", codepoints: %s", result, toCodepoints(result)));
}

void test() {
    String input = "ギ a"; // Unicode Codepoints: \uff77 \uff9e \u20 \u61

    transliterateWithRules(input, ":: Halfwidth-Fullwidth;");
    // Result:
    // "ギ a", codepoints: \u30ae \u3000 \uff41
    // This makes everything fullwidth, including the space and the latin 'a'

    transliterateWithRules(input, "::  [:Katakana:]; :: Halfwidth-Fullwidth;");
    // Result:
    // "ギ a", codepoints: \u30ad \uff9e \u20 \u61
    // This makes the Katakana KI fullwidth, and skips the space and 'a' as intended, but it also
    // skips the Halfwidth Katakana Voiced Sound Mark (U+FF9E), which I expected to be converted.

    transliterateWithRules(input, ":: [:^Latin:] Halfwidth-Fullwidth;");
    // Result:
    // "ギ a", codepoints: \u30ae \u3000 \u61
    // Skips the latin 'a' as intended, but makes the space Fullwidth, which I don't want

    transliterateWithRules(input, ":: [[:^Latin:]&[^\\ ]]; :: Halfwidth-Fullwidth;");
    // Result:
    // "ギ a", codepoints: \u30ae \u20 \u61
    // Exactly what I want, for this example - but relying on a list of exclusions is fragile, since I am only
    // excluding what I currently know about.
}

最佳答案

您可以在[:Katakana:]here中看到一个字符列表,其中既不包含U+FF9E也不包含U+FF9F

这是因为[:Katakana:]等效于[:Script=Katakana:],后者测试字符的"Script" propertyU+FF9EU+FF9F都被标记为在平假名和片假名文本中使用,因此它们的脚本属性是“ Common”(而不是像这样的字符,后者专门是“片假名”)。有一个“脚本扩展”属性,其中包含两个脚本,但是[:Katakana:]不会对此进行检查。

您可以将它们手动添加到集合([[:Katakana:]\uFF9E\uFF9F]),或创建一个包含脚本扩展名的集合:

[\p{sc=Katakana}\p{scx=Katakana}]


(请注意,它也包含其他一些共享字符。)

09-03 17:26