我正在尝试使用Android上的递归目录搜索将带有文件名的搜索字符串进行匹配。问题在于字符是日语,在某些情况下不匹配。例如,我要匹配文件名开头的搜索字符串是“呼ぶ”。当我从file.getName()打印文件名时,这可以准确反映出来,例如打印到控制台的文件名以“呼ぶ”开头。但是当我对搜索字符串进行匹配时,例如fileName.startwith(“呼ぶ”),不匹配。
事实证明,当我打印要搜索的文件名的子字符串时,第二个字符不同–单词是“呼ふ”而不是“呼ぶ”。如果我提取字节并打印十六进制字符,则最后一个字节偏移1 –大概是“ぶ”和“ふ”之间的差。
这是用于显示差异的代码:
String name = soundFile.getName();
String string1 = question.kanji;
Log.d(TAG, "searching for : s1:" + question.kanji + " + " + question.hiragana + " + " + question.english);
Log.d(TAG, "name is: " + name);
Log.d(TAG, "question.kanaji.length(): " + question.kanji.length());
Log.d(TAG, "question.hiragana.length(): " + question.hiragana.length());
String compareStart = name.substring(0, string1.length() );
Log.d(TAG, "string1.length(): " + string1.length());
Log.d(TAG, "compareStart.length(): " + compareStart.length());
byte[] nameUTF8 = null;
byte[] s1UTF8 = null;
byte[] csUTF8 = null;
nameUTF8 = name.getBytes();
s1UTF8 = string1.getBytes();
csUTF8 = compareStart.getBytes();
Log.d(TAG, "nameUTF8.length: " + s1UTF8.length);
Log.d(TAG, "s1UTF8.length: " + s1UTF8.length);
Log.d(TAG, "csUTF8.length: " + csUTF8.length);
for (int i = 0; i < s1UTF8.length; i++) {
Log.d(TAG, "s1UTF8[i]: " + Integer.toString(s1UTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < csUTF8.length; i++) {
Log.d(TAG, "csUTF8[i]: " + Integer.toString(csUTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < nameUTF8.length; i++) {
Log.d(TAG, "nameUTF8[i]: " + Integer.toString(nameUTF8[i] & 0xff, 16).toUpperCase());
}
部分输出如下:
D/AnswerView(12078): searching for : s1:呼ぶ + よぶ + to call out,to invite
D/AnswerView(12078): name is: 呼ぶ よぶ to call out,to invite.mp3
D/AnswerView(12078): question.kanaji.length(): 2
D/AnswerView(12078): question.hiragana.length(): 2
D/AnswerView(12078): string1: 呼ぶ
D/AnswerView(12078): compareStart: 呼ふ
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): nameUTF8.length: 6
D/AnswerView(12078): s1UTF8.length: 6
D/AnswerView(12078): csUTF8.length: 6
D/AnswerView(12078): s1UTF8[i]: E5
D/AnswerView(12078): s1UTF8[i]: 91
D/AnswerView(12078): s1UTF8[i]: BC
D/AnswerView(12078): s1UTF8[i]: E3
D/AnswerView(12078): s1UTF8[i]: 81
D/AnswerView(12078): s1UTF8[i]: B6
D/AnswerView(12078): csUTF8[i]: E5
D/AnswerView(12078): csUTF8[i]: 91
D/AnswerView(12078): csUTF8[i]: BC
D/AnswerView(12078): csUTF8[i]: E3
D/AnswerView(12078): csUTF8[i]: 81
D/AnswerView(12078): csUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E5
D/AnswerView(12078): nameUTF8[i]: 91
D/AnswerView(12078): nameUTF8[i]: BC
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 81
D/AnswerView(12078): nameUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 82
D/AnswerView(12078): nameUTF8[i]: 99
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
显示提取的文件名子字符串的第六个字节以及文件名本身是“ B5”,而不是搜索字符串中的“ B6”。但是,正确显示了打印的文件名。我很困惑当基础字符不同时,为什么文件名正确显示在控制台上?为什么在文件名的开头再增加3个非空字节-某种程度上不需要在搜索字符串中表示“ぶ”字符?
最佳答案
该问题似乎是规范化形式之一。我知道在Mac上,例如,文件系统始终在NFD中。但是您发布的字符串在NFC中。看:
% cat /tmp/u
呼ぶ
% uwc /tmp/u
Paras Lines Words Graphs Chars Bytes File
0 1 1 3 3 7 /tmp/u
% uniquote -v /tmp/u
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER BU}
% nfd /tmp/u | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER HU}\N{COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK}
% nfc /tmp/u | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-547C}\N{HIRAGANA LETTER BU}
因此,我认为您将不得不考虑转换为NFD。
顺便说一句,U + 547C CJK代码点恰好来自Unihan数据库:
呼 U+547C Lo Han CJK UNIFIED IDEOGRAPH-547C
Mandarin hu1 xu1
Cantonese fu1
JapaneseKun yobu
JapaneseOn ko
Korean ho
HanyuPinlu hu1(378) hu5(107)
Vietnamese hô