问题描述
尝试在Java中将字节转换为String时遇到问题,代码如下:
byte [] bytes = {1,2,-3};
byte [] transferred = new String(bytes,Charsets.UTF_8).getBytes(Charsets.UTF_8);
且原始字节与传输的字节不同,分别为
[1,2,-3]
[1,2,-17,-65,-67]
我曾经认为这是由于负数-3的UTF-8字符集映射。所以我把它改成-32。但转移的阵列保持不变!
[1,2,-32]
[1,2, - 17,-65,-67]
所以我非常想知道当我打电话给新的时候会发生什么字符串(字节):)
并非所有字节序列在UTF-8中都有效。
中)是非法的,但你的字节数组没有这样的顺序。
您的UTF-8无效。 Java UTF-8解码器使用Unicode代码点替换此无效字节 -3
(另见)。在UTF-8中,代码点U + FFFD是十六进制 0xEF 0xBF 0xBD
(二进制 11101111 10111111 10111101
),用Java表示as -17,-65,-67
。
I have a problem when trying to convert bytes to String in Java, with code like:
byte[] bytes = {1, 2, -3};
byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);
and the original bytes are not the same as the transferred bytes, which are respectively
[1, 2, -3]
[1, 2, -17, -65, -67]
I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!
[1, 2, -32]
[1, 2, -17, -65, -67]
So I strongly want to know exactly what happens when I call new String(bytes) :)
Not all sequences of bytes are valid in UTF-8.
UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.
Refer to this table:
Now let's see how it applies to your {1, 2, -3}
:
Bytes 1
(hex 0x01
, binary 00000001
) and 2
(hex 0x02
, binary 00000010
) stand alone, no problem.
Byte -3
(hex 0xFD
, binary 11111101
) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.
Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3
with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD
(binary 11101111 10111111 10111101
), represented in Java as -17, -65, -67
.
这篇关于在Java中将字节转换为String时会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!