问题描述
我使用编码UTF-8将对象编组到XML文件。它成功生成文件。但是当我尝试解组时,会出现错误:
I am marshalling objects to XML file using encoding "UTF-8". It generates file successfully. But when I try to unmarshal it back, there is an error:
字符是0x1A或\\\,在UTF-8中有效,但在XML中是非法的。 JAXB中的Marshaller允许将此字符写入XML文件,但Unmarshaller无法解析它。我尝试使用其他编码(UTF-16,ASCII等),但仍然是错误。
The character is 0x1A or \u001a, which is valid in UTF-8 but illegal in XML. Marshaller in JAXB allows writing this character into XML file, but Unmarshaller cannot parse it back. I tried to use another encoding (UTF-16, ASCII, etc) but still error.
常见的解决方案是在XML解析之前删除/替换此无效字符。但是如果我们需要这个角色,如何在解组后获得原始角色?
The common solution is to remove/replace this invalid character before XML parsing. But if we need this character back, how to get the original character after unmarshalling?
在寻找这个解决方案时,我想要在解组之前用替换字符替换无效字符(例如dot =。)。
While looking for this solution, I want to replace the invalid characters with a substitute character (for example dot = ".") before unmarshalling.
我创建了这个类:
public class InvalidXMLCharacterFilterReader extends FilterReader {
public static final char substitute = '.';
public InvalidXMLCharacterFilterReader(Reader in) {
super(in);
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int read = super.read(cbuf, off, len);
if (read == -1)
return -1;
for (int readPos = off; readPos < off + read; readPos++) {
if(!isValid(cbuf[readPos])) {
cbuf[readPos] = substitute;
}
}
return readPos - off + 1;
}
public boolean isValid(char c) {
if((c == 0x9)
|| (c == 0xA)
|| (c == 0xD)
|| ((c >= 0x20) && (c <= 0xD7FF))
|| ((c >= 0xE000) && (c <= 0xFFFD))
|| ((c >= 0x10000) && (c <= 0x10FFFF)))
{
return true;
} else
return false;
}
}
这就是我读取和解组文件的方式:
Then this is how I read and unmarshall the file:
FileReader fileReader = new FileReader(this.getFile());
Reader reader = new InvalidXMLCharacterFilterReader(fileReader);
Object o = (Object)um.unmarshal(reader);
不知何故,读者不会用我想要的字符替换无效字符。它导致错误的XML数据无法解组。我的InvalidXMLCharacterFilterReader类有问题吗?
Somehow the reader does not replace invalid characters with the character I want. It results a wrong XML data which can't be unmarshalled. Is there something wrong with my InvalidXMLCharacterFilterReader class?
推荐答案
Unicode字符U + 001A是:
The Unicode character U+001A is illegal in XML 1.0:
用于表示它的编码确实如此在这种情况下无关紧要,在XML内容中根本不允许。
The encoding used to represent it does not matter in this case, it's simply not allowed in XML content.
(包括U + 001A),但必须作为数字字符引用(<$ c $) c>&#x1a; )
XML 1.1 allows some of the restricted characters (including U+001A) to be included, but they must be present as numeric character references (
)
维基百科。
这篇关于Unmarshall期间无效的XML字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!