问题描述
我在使用UTF8字符串比较时遇到了这个问题,我真的不知道该问题,它开始让我头疼.请帮帮我.
基本上,我从UTF8编码的xml文档中得到了这个字符串:'Mina Tidigareanställningar'
当我将该字符串与我输入的完全相同的字符串进行比较时:'MinaTidigareanställningar'(也在UTF8中).结果为FALSE !!!
我不知道为什么.真奇怪有人可以帮我吗?
I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?
推荐答案
.为简化起见,有几种方法可以用Unicode(因此也可以使用UTF8)获得相同的文本:例如,这:ř
可以写为一个字符ř
或两个字符:r
和 ˇ
.
This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř
can be written as one character ř
or as two characters: r
and the combining ˇ
.
您最好的选择是规范化器类-对两者进行规范化字符串转换为相同的归一化形式并比较结果.
Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.
在其中一项注释中,显示以下字符串的十六进制表示形式:
In one of the comments, you show these hex representations of the strings:
4d696e61205469646967617265 20 616e7374 c3a4 6c6c6e696e676172 // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
^^-----------------^^^^1 ^^^^^^2
请注意我标记的部分,显然这个问题有两个部分.
Note the parts I marked, apparently there are two parts to this problem.
-
首先,请观察关于字节序列"c2a0"的含义的问题-由于某种原因,您的键入将转换为XML文件具有普通空间的不可中断空间.请注意,在两种情况下,"Mina"之后都有一个正常的空格.除了用正常空间替换所有空格外,不确定在PHP中如何处理那个.
For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
对于第二种情况,就是我上面概述的情况:c3a4
是 ä
(U + 00E4带DIAERESIS的拉丁文小写字母A"-一个字符,两个字节),而61
是 a
(U + 0061拉丁文小写字母A"-一个字符,一个字节)和cc88
将是组合的变音符号 "
(U + 0308"COMBINING DIAERESIS" –两个字符,三个字节).在这里,规范化库应该有用.
As to the second, that is the case I outlined above: c3a4
is ä
(U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61
is a
(U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88
would be the combining umlaut "
(U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.
这篇关于奇怪的UTF8字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!