奇怪的UTF8字符串比较

奇怪的UTF8字符串比较

本文介绍了奇怪的UTF8字符串比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用UTF8字符串比较时遇到了这个问题,我真的不知道该问题,它开始让我头疼.请帮帮我.
基本上,我从UTF8编码的xml文档中得到了这个字符串:'Mina Tidigareanställningar'
当我将该字符串与我输入的完全相同的字符串进行比较时:'MinaTidigareanställningar'(也在UTF8中).结果为FALSE !!!
我不知道为什么.真奇怪有人可以帮我吗?

I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?

推荐答案

.为简化起见,有几种方法可以用Unicode(因此也可以使用UTF8)获得相同的文本:例如,这:ř可以写为一个字符ř或两个字符:r ˇ.

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

您最好的选择是规范化器类-对两者进行规范化字符串转换为相同的归一化形式并比较结果.

Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

在其中一项注释中,显示以下字符串的十六进制表示形式:

In one of the comments, you show these hex representations of the strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

请注意我标记的部分,显然这个问题有两个部分.

Note the parts I marked, apparently there are two parts to this problem.

  • 首先,请观察关于字节序列"c2a0"的含义的问题-由于某种原因,您的键入将转换为XML文件具有普通空间的不可中断空间.请注意,在两种情况下,"Mina"之后都有一个正常的空格.除了用正常空间替换所有空格外,不确定在PHP中如何处理那个.

  • For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.

对于第二种情况,就是我上面概述的情况:c3a4 ä (U + 00E4带DIAERESIS的拉丁文小写字母A"-一个字符,两个字节),而61 a (U + 0061拉丁文小写字母A"-一个字符,一个字节)和cc88将是组合的变音符号 " (U + 0308"COMBINING DIAERESIS" –两个字符,三个字节).在这里,规范化库应该有用.

As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.

这篇关于奇怪的UTF8字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 22:26