java - Java:在Java程序中解释UTF-8

我的程序正在从浏览器应用程序接收一个整数数组，该数组被解释为UTF-8（代码示例）。我可以将生成的字符串（下面的代码中显示的“ theString”回显）回浏览器，一切正常。但这在Java程序中并不理想。输入字符串为“Hällo”。但是它从Java程序中打印为“Hõllo”。

import java.io.*;
import java.nio.charset.*;

public class TestCode {
   public static void main (String[] args) throws IOException {

      // H : 72
      // ä : 195 164
      // l : 108
      // o : 111
      // the following is the input sent from browser representing String = "Hällo"
      int[] utf8Array = {72, 195, 164, 108, 108, 111};

      String notYet = new String(utf8Array, 0, utf8Array.length);
      String theString = new String(notYet.getBytes(), Charset.forName("UTF-8"));

      System.out.println(theString);
   }
}

最佳答案

这将达到目的：

int[] utf8Array = {72, 195, 164, 108, 108, 111};
byte[] bytes = new byte[utf8Array.length];
for (int i = 0; i < utf8Array.length; ++i) {
    bytes[i] = (byte) utf8Array[i];
}
String theString = new String(bytes, Charset.forName("UTF-8"));

直接传递int[]的问题是String类将每个int解释为单独的字符，而转换为byte[]后，String将输入视为原始字节，并理解195, 164实际上是一个由两个字节而不是两个字符组成的单个字符。

更新：不幸的是，回答您的评论，Java太冗长了。与Scala进行比较：

val ints = Array(72, 195, 164, 108, 108, 111)
println(new String(ints map (_.toByte), "UTF-8"))

同样，int和byte之间的区别不仅在于编译器挑剔，而且在涉及UTF-8编码时，它们实际上意味着不同的含义。