问题描述
在Python 2.7中,当将一个unicode字符串传递给ElementTree的 fromstring()
方法时,它具有 encoding =UTF-16
在XML声明中,我得到一个ParseError表示指定的编码是不正确的:
In Python 2.7, when passing a unicode string to ElementTree's fromstring()
method that has encoding="UTF-16"
in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:
>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30
这意味着什么?什么使ElementTree这样认为?
What does that mean? What makes ElementTree think so?
毕竟,我传入unicode码点,而不是一个字节字符串。这里没有编码。如何不正确?
After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?
当然,可以认为任何编码都不正确,因为这些unicode代码点不被编码。然而,为什么UTF-8不会被拒绝为不正确的编码?
Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?
>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')
我可以通过将unicode字符串编码为UTF-16编码的字节字符串并将其传递到 fromstring()
或通过在unicode中替换 encoding =utf-16
与 encoding =utf-8
字符串,但我想了解为什么会引发异常。关于ElementTree的,操作字节,而不是unicode字符。你必须调用 .encode('utf-16-be')
或 .encode('utf-16-le')
在你将它传递给 ElementTree.fromstring
之前的字符串:
In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be')
or .encode('utf-16-le')
on the string before you pass it to ElementTree.fromstring
:
ElementTree.fromstring(data.encode('utf-16-be'))
证明: ElementTree.fromstring
最终调用 pyexpat.xmlparser.Parse
,其中在pyexpat.c中实现:
Proof: ElementTree.fromstring
eventually calls down into pyexpat.xmlparser.Parse
, which is implemented in pyexpat.c:
static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
char *s;
int slen;
int isFinal = 0;
if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
return NULL;
return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}
所以你传递的unicode参数使用小号#
。 PyArg_ParseTuple
的说:
So the unicode parameter you passed in gets converted using s#
. The docs for PyArg_ParseTuple
say:
检查出来:
from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)
给出错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)
这意味着当您指定 encoding =utf-8
时,您只是幸运的是,当Unicode字符串被编码为输入时,没有非ASCII字符ASCII。如果您在解析之前添加以下内容,UTF-8将按照预期的方式工作:
which means that when you were specifying encoding="utf-8"
, you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:
import sys
reload(sys).setdefaultencoding('utf8')
然而,它不起作用来设置defaultencoding到'utf-16-be'或'utf-16-le',因为ElementTree的Python位做直接字符串比较,它们在UTF-16地区开始失败。
however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.
这篇关于为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!