为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明？

本文介绍了为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Python 2.7中，当将一个unicode字符串传递给ElementTree的 fromstring（）方法时，它具有 encoding =UTF-16在XML声明中，我得到一个ParseError表示指定的编码是不正确的：

In Python 2.7, when passing a unicode string to ElementTree's fromstring() method that has encoding="UTF-16" in the XML declaration, I'm getting a ParseError saying that the encoding specified is incorrect:

>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

这意味着什么？什么使ElementTree这样认为？

What does that mean? What makes ElementTree think so?

毕竟，我传入unicode码点，而不是一个字节字符串。这里没有编码。如何不正确？

After all, I'm passing in unicode codepoints, not a byte string. There is no encoding involved here. How can it be incorrect?

当然，可以认为任何编码都不正确，因为这些unicode代码点不被编码。然而，为什么UTF-8不会被拒绝为不正确的编码？

Of course, one could argue that any encoding is incorrect, as these unicode codepoints are not encoded. However, then why is UTF-8 not rejected as "incorrect encoding"?

>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')

我可以通过将unicode字符串编码为UTF-16编码的字节字符串并将其传递到 fromstring（）或通过在unicode中替换 encoding =utf-16与 encoding =utf-8字符串，但我想了解为什么会引发异常。关于ElementTree的，操作字节，而不是unicode字符。你必须调用 .encode（'utf-16-be'）或 .encode（'utf-16-le'）在你将它传递给 ElementTree.fromstring 之前的字符串：

In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

ElementTree.fromstring(data.encode('utf-16-be'))

证明： ElementTree.fromstring 最终调用 pyexpat.xmlparser.Parse ，其中在pyexpat.c中实现：

Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

所以你传递的unicode参数使用小号＃。 PyArg_ParseTuple 的说：

So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:

检查出来：

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

给出错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

这意味着当您指定 encoding =utf-8时，您只是幸运的是，当Unicode字符串被编码为输入时，没有非ASCII字符ASCII。如果您在解析之前添加以下内容，UTF-8将按照预期的方式工作：

which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

import sys
reload(sys).setdefaultencoding('utf8')

然而，它不起作用来设置defaultencoding到'utf-16-be'或'utf-16-le'，因为ElementTree的Python位做直接字符串比较，它们在UTF-16地区开始失败。

however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

这篇关于为什么ElementTree拒绝具有“编码不正确”的UTF-16 XML声明？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！