为什么ElementTree拒绝具有

为什么ElementTree拒绝具有

本文介绍了为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 2.7中,当将一个unicode字符串传递给ElementTree的 fromstring()方法时,它具有 encoding =UTF-16在XML声明中,我得到一个ParseError表示指定的编码是不正确的:

In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:

>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

这意味着什么?什么使ElementTree这样认为?

What does that mean? What makes ElementTree think so?

毕竟,我传入unicode码点,而不是一个字节字符串。这里没有编码。如何不正确?

After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?

当然,可以认为任何编码都不正确,因为这些unicode代码点不被编码。然而,为什么UTF-8不会被拒绝为不正确的编码?

Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?

>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')

我可以通过将unicode字符串编码为UTF-16编码的字节字符串并将其传递到 fromstring()或通过在unicode中替换 encoding =utf-16 encoding =utf-8字符串,但我想了解为什么会引发异常。关于ElementTree的,操作字节,而不是unicode字符。你必须调用 .encode('utf-16-be') .encode('utf-16-le')在你将它传递给 ElementTree.fromstring 之前的字符串:

In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

ElementTree.fromstring(data.encode('utf-16-be'))






证明: ElementTree.fromstring 最终调用 pyexpat.xmlparser.Parse ,其中在pyexpat.c中实现:


Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

所以你传递的unicode参数使用小号# PyArg_ParseTuple 的说:

So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:

检查出来:

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

给出错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

这意味着当您指定 encoding =utf-8时,您只是幸运的是,当Unicode字符串被编码为输入时,没有非ASCII字符ASCII。如果您在解析之前添加以下内容,UTF-8将按照预期的方式工作:

which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

import sys
reload(sys).setdefaultencoding('utf8')

然而,它不起作用来设置defaultencoding到'utf-16-be'或'utf-16-le',因为ElementTree的Python位做直接字符串比较,它们在UTF-16地区开始失败。

however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

这篇关于为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 20:39