java - JAXB和UTF-8解码异常“2字节UTF-8序列的无效字节2”

我已经读了一些SO回答，说JAXB有一个错误，它归咎于XML的性质，导致它不能与UTF-8一起使用。我的问题是，解决方法是什么？我的用户输入的unicode字符可能会复制并粘贴到我需要保留，封送，解组并在其他位置重新显示的数据字段中。

（更新）
更多内容：

Candidate c = new Candidate();
c.addSubstitution("3 4ths", "\u00BE");
c.addSubstitution("n with tilde", "\u00F1");
    c.addSubstitution("schwa", "\u018F");
    c.addSubstitution("Sigma", "\u03A3");
    c.addSubstitution("Cyrillic Th", "\u040B");
    jc = JAXBContext.newInstance(Candidate.class);
    Marshaller marshaller = jc.createMarshaller();
    marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
    marshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-8");
    ByteArrayOutputStream os = new ByteArrayOutputStream();
    marshaller.marshal(c, os);
    String xml = os.toString();
    System.out.println(xml);
    jc = JAXBContext.newInstance(Candidate.class);
    Unmarshaller jaxb = jc.createUnmarshaller();
    ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());
    Candidate newCandidate = (Candidate) jaxb.unmarshal(is);
    for(Substitution s:c.getSubstitutions()) {
        System.out.println(s.getSubstitutionName() + "='" + s.getSubstitutionValue() + "'");
    }

这是我放在一起的一点测试。我得到的确切字符并不完全在我的控制之下。用户可以将带有波浪号的N粘贴到字段中或其他任何内容。

最佳答案

这是您的测试代码中的问题：

ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes());

您正在使用平台默认编码将字符串转换为字节数组。不要那样做您已经指定要使用UTF-8，因此在创建字节数组时必须这样做：

ByteArrayInputStream is = new ByteArrayInputStream(xml.getBytes("UTF-8"));

同样，不要使用ByteArrayOutputStream.toString()，它再次使用平台默认编码。实际上，您根本不需要将输出转换为字符串：

ByteArrayOutputStream os = new ByteArrayOutputStream();
marshaller.marshal(c, os);
byte[] xml = os.toByteArray();
jc = JAXBContext.newInstance(Candidate.class);
Unmarshaller jaxb = jc.createUnmarshaller();
ByteArrayInputStream is = new ByteArrayInputStream(xml);

这与您使用的字符应该没有问题-仍然会有XML 1.0无法表示的问题（U + 0020以下的字符，\r，\n和\t除外），仅此而已。