问题描述
我有一些原始文本,通常是有效的 UTF-8 字符串.然而,有时会发现输入实际上是一个 CESU-8 字符串.技术上可以检测到这一点并转换为 UTF-8,但由于这种情况很少发生,我宁愿不花大量 CPU 时间来执行此操作.
I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.
是否有任何快速方法来检测字符串是用 CESU-8 还是 UTF-8 编码的?我想我总是可以盲目地将UTF-8"转换为 UTF-16LE,然后使用 iconv()
再转换为 UTF-8,我可能每次都会得到正确的结果,因为 CESU-8 已经足够接近了到 UTF-8 才能工作.您能提出更快的建议吗?(我希望输入字符串是 CESU-8 而不是有效的 UTF-8,大约占所有字符串出现次数的 0.01-0.1%.)
Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv()
and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)
(CESU-8 是一种非标准的字符串格式,它包含以 UTF-8 编码的 16 位代理对.从技术上讲,UTF-8 字符串应该包含由这些代理对表示的字符,而不是代理对本身.)
推荐答案
这里有一个更高效的转换函数版本:
Here's a more efficient version of your conversion function:
$regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
$s = preg_replace_callback($regex, function($m) {
$in = unpack("C*", $m[0]);
$in[2] += 1; // Effectively adds 0x10000 to the codepoint.
return pack("C*",
0xF0 | (($in[2] & 0x1C) >> 2),
0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
$in[6]
);
}, $s);
代码只转换高代理后低代理,将两个三字节的CESU-8序列直接转换成四字节的UTF-8序列,即来自
The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from
ED A0-AF 80-BF ED B0-BF 80-BF
11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd
到
F0-F4 80-BF 80-BF 80-BF
11110oaa 10aabbbb 10bbcccc 10dddddd // o is "overflow" bit
这是一个在线示例.
这篇关于以高性能将 CESU-8 转换为 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!