




I'm trying to parse an XML file using DocumentBuilderFactory as follows:

DocumentBuilderFactory ndsParserFactory = DocumentBuilderFactory.newInstance( );
ndsParserFactory.setNamespaceAware( true );
DocumentBuilder ndsParser = ndsParserFactory.newDocumentBuilder( );
Document ndsDocument = ndsParser.parse( ndsFileInputStream );


where ndsFileInputStream is an InputStream wrapping the file containing the XML.

我得到在文件中包含的Uni code字符,如Δ例外。当我带出含有违规的字符所在的行,解析工作得很好。

I get an exception when the file contains a Unicode character such as Δ. When I strip out the line containing the offending character, the parsing works just fine.



I'm wondering if I'm neglecting to configure the DocumentBuilderFactory (or DocumentBuilder) instance properly in order to handle the Δ character.


披露:这是Android的,而我,包括XML文件(使用NDS文件扩展名)在我的Andr​​oid应用程序的资产。我通过AssetManager,这对打开资产文件转换成一个InputStream,然后我传递给我的DocumentBuilder的解析方法的方便,花花公子方式访问它们。 - ð焊缝15小时前

Full disclosure: This is Android, and I'm including XML files (with an NDS file extension) as assets in my Android app. I access them via the AssetManager, which has a handy-dandy method for opening an asset file into an InputStream, which I then pass to the parse method of my DocumentBuilder. – d weld 16 hours ago

我注意到,资产文件夹在默认情况下其内容使用CP1252的编码。所以我改变了这一切为UTF8。没有运气。然后,我从(每条链路)的NDS文件之一删除BOM和再次尝试。没有运气。我在想,apk文件(这是COM pressed像一个ZIP文件)以某种方式重整非ASCII XML。我想我将不得不诉诸其他手段获取NDS文件到Android设备......

I noticed that the assets folder uses an encoding of CP1252 by default for its contents. So I changed that to UTF8. No luck. Then I removed the BOM from one of the NDS files (per link) and tried again. No luck. I'm thinking that the APK file (which is compressed like a ZIP file) is somehow mangling the non-ASCII XML. I think I'll have to resort to getting the NDS files onto the Android device by other means...



Are you sure the file is really written as UTF-8? Obviously you can open it in some editor and it shows the text correctly, but it could just be making a good guess as the encoding.

其他的事情要记住的是所有的人物都是统一code为UTF-8 - 当它击中一个字节序列,是不是在声明编码有效解析器只是呛。 UTF-8是一个非常宽容的编码作为7位ASCII字符集的任何字符用为en codeD,就好像它是纯ASCII,以及大量的XML是由什么,但纯ASCII字符。这就抓住了人们的东西时,非ASCII通过显现一个系统自带的文本编码路径,突然缺陷。

The other thing to remember is all the characters are Unicode in UTF-8 - the parser is just choking when it hits a byte sequence that isn't valid in the declared encoding. UTF-8 is a very forgiving encoding to use as any character in the 7-bit ASCII set is encoded as if it is plain ASCII, and a lot of XML is made up of nothing but plain ASCII characters. This then catches people out when something non-ASCII comes up and suddenly defects in the text encoding paths through a system become apparent.

您可以尝试编辑XML声明,看看它是否解析下一个字符编码确定; 包含Δ符号 - 会不会是连接$ C $光盘呢?

You could try editing the XML declaration and see if it parses ok under another character encoding; 8859-7 contains the Δ symbol - could it be encoded in that?



07-29 14:04