


  byte [] bytes = {1,2,-3}; 

byte [] transferred = new String(bytes,Charsets.UTF_8).getBytes(Charsets.UTF_8);




[1,2, - 17,-65,-67]





您的UTF-8无效。 Java UTF-8解码器使用Unicode代码点替换此无效字节 -3 (另见)。在UTF-8中,代码点U + FFFD是十六进制 0xEF 0xBF 0xBD (二进制 11101111 10111111 10111101 ),用Java表示as -17,-65,-67

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.UTF_8).getBytes(Charsets.UTF_8);

and the original bytes are not the same as the transferred bytes, which are respectively

[1, 2, -3]
[1, 2, -17, -65, -67]

I once thought it is due to the UTF-8 charset mapping for the negative "-3". So I change it to "-32". But the transferred array remains the same!

[1, 2, -32]
[1, 2, -17, -65, -67]

So I strongly want to know exactly what happens when I call new String(bytes) :)


Not all sequences of bytes are valid in UTF-8.

UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

Refer to this table:

Now let's see how it applies to your {1, 2, -3}:

Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.


08-20 06:20