问题描述
最近几天,我一直在阅读有关Unicode和UTF-8的文章,我经常碰到类似这样的按位比较:
int strlen_utf8(char *s)
{
int i = 0, j = 0;
while (s[i])
{
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
有人可以澄清与0xc0的比较并检查它是否是最高有效位吗?
谢谢!
ANDed,而不是比较,使用了错误的单词;)
这不是与0xc0
的比较,它是与0xc0
的逻辑与运算.
位掩码0xc0
是11 00 00 00
,因此AND所做的只是提取前两位:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
然后将其与0x80
(二进制10 00 00 00
)进行比较.换句话说,if
语句正在检查该值的高两位是否不等于10
.
为什么?",我听到你问.好吧,这是一个好问题.答案是,在UTF-8中,所有以位模式10
开头的字节都是多字节序列的后续字节:
UTF-8
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
所以,这个小片段正在做的事情是遍历您的UTF-8字符串的每个字节,并计算所有不是连续字节的字节(即,如广告所示,它获取字符串的长度).有关更多详细信息,请参见此Wikipedia链接和乔尔·斯波斯基(Joel Spolsky)的精彩文章.
顺便说一句有趣的事.您可以按以下方式对UTF-8流中的字节进行分类:
- 将高位设置为
0
时,它是一个单字节值. - 将两个高位设置为
10
,这是一个连续字节. - 否则,它是一个多字节序列的第一个字节,前导
1
位的数量表示该序列总共有多少个字节(110...
表示两个字节,1110...
表示三个字节,等).
I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :
int strlen_utf8(char *s)
{
int i = 0, j = 0;
while (s[i])
{
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?
Thank you!
EDIT: ANDed, not comparison, used the wrong word ;)
It's not a comparison with 0xc0
, it's a logical AND operation with 0xc0
.
The bit mask 0xc0
is 11 00 00 00
so what the AND is doing is extracting only the top two bits:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
This is then compared to 0x80
(binary 10 00 00 00
). In other words, the if
statement is checking to see if the top two bits of the value are not equal to 10
.
"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10
are subsequent bytes of a multi-byte sequence:
UTF-8
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.
An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:
- With the high bit set to
0
, it's a single byte value. - With the two high bits set to
10
, it's a continuation byte. - Otherwise, it's the first byte of a multi-byte sequence and the number of leading
1
bits indicates how many bytes there are in total for this sequence (110...
means two bytes,1110...
means three bytes, etc).
这篇关于UTF-8和Unicode,0xC0和0x80是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!