问题描述
因此,我正在要求最后的手段,因为我完全没有想法。我有一个Windows ASP.NET ASMX Web服务应用程序,返回一个序列化的Person对象,其中包含 -
名称,地址,电子邮件...等
但xml中的某些属性非常奇怪地被编码,例如 - &#x1a
(我不知道在哪里编码发生在我的序列化过程中)
我看到它是Windows-1252编码。
解析XML时出现问题,我发现,1252编码位置的无效unicode字符的解析错误。
如何成功解析?你建议什么解决方案?
解析器是正确的,无论生成的序列化是错误的。与大多数C0 / C1控制字符一样,它是无效的 - 实际上比以下更糟糕:没有很好的形式 - 放一个转换为XML 1.0文件(*),即使编码为字符参考,如
。 / p>
没有XML解析器将读取这个,也不应该。虽然您可以在将其传递给解析器之前,尝试过滤掉
序列,但是这种粗暴的黑客在一般情况下不起作用。应该固定序列号来停止生产。
其实我不知道这个角色(通常用于在古老的可怕操作系统中标记文件末尾)进入ASP.NET应用程序使用的数据集,但它似乎不会在名称,地址或电子邮件中发挥任何有效的作用。或许你真的需要看清楚你的数据。
(*:如果在XML 1.1文档中编码为字符引用是合法的,如果你绝对必须通过XML的往返控制字符,您将不得不使用XML 1.1,尽管这可能会导致与旧的XML解析器的兼容性问题,您仍然无法使用U + 0000 NULL字符,因此您永远不会完全二进制安全。)
SO, I am asking as a last resort, as I am completely out of ideas.
I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a --name, address, email... etc
but some attributes in the xml are encoded very weirdly, for instance- 
(I dont know where the encoding takes place. I assume in the serialization process)
googling those characters I see that it is "Windows-1252" encoding.
The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.
how can I successfully parse it? what solutions do you suggest?
The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as 
.
No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out 
sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.
Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.
(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)
这篇关于xml解析错误的非法字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!