当使用UTF-8输出时，Python ElementTree不会转换不间断的空格

本文介绍了当使用UTF-8输出时，Python ElementTree不会转换不间断的空格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使用Python的ElementTree解析，操纵和输出HTML：

  import sys 
 from cStringIO import StringIO 
 from xml.etree import ElementTree as ET 
 from htmlentitydefs import entitydefs 
 
 source = StringIO（< html> 
< body> 
< p>小于& lt< / p> 
< p>不间断空间& nbsp;< / p> 
< / body> 
< / html>）
 
 parser = ET.XMLParser（）
 parser.parser.UseForeignDTD（True）
 parser.entity.update（entitydefs ）
 etree = ET.ElementTree（）
 
 tree = etree.parse（source，parser = parser）
 for p in tree.findall（'.// p'） ：
 print ET.tostring（p，encoding ='UTF-8'）

我在Mac OS X 10.6上使用Python 2.7运行，我得到：

 < p>小于< lt< ; / p为H. 
 
追溯（最近的最后一次调用）：
文件bar.py，第20行，< module> 
打印ET.tostring（p，encoding ='utf-8'）
文件/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree。 py，第1120行，instring 
 ElementTree（element）.write（file，encoding，method = method）
文件/Library/Frameworks/Python.framework/Versions/2.7/lib/python2。 7 / xml / etree / ElementTree.py，第815行，写入
 serialize（write，self._root，encoding，qnames，namespaces）
文件/Library/Frameworks/Python.framework/Versions /2.7/lib/python2.7/xml/etree/ElementTree.py，第931行，_serialize_xml 
写（_escape_cdata（文本，编码））
文件/Library/Frameworks/Python.framework $ V $ s 
 UnicodeDecodeError：'ascii'编解码器无法解码位置19中的字节0xa0：序号不在范围（128）

我以为指定encoding ='UTF-8'w应该照顾不间断的空间人物，但显然没有。我应该怎么办？

解决方案

0xA0是一个latin1字符，而不是一个unicode字符，p.text的值在循环是一个str而不是unicode，这意味着为了在utf-8中进行编码，它必须首先被Python隐式转换为unicode字符串（即使用decode）。当它这样做它假定ascii，因为它没有被告知任何其他。 0xa0不是有效的ASCII字符，但它是一个有效的latin1字符。

您的latin1字符而不是Unicode字符的原因是因为entitydefs是一个映射到latin1编码字符串。您需要可以从htmlentitydef.name2codepoint获取的unicode代码点

下面的版本应该为您解决：

来自cStringIO的import $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

$ b source = StringIO（< html>
< body>
< p>小于& lt< / p>
< p>

parser = ET.XMLParser（）
< / p>
< / body>
parser.parser.UseForeignDTD（True）
parser.entity.update（（x，unichr（i））for x，i in name2codepoint.iteritems（））
etree = ET.ElementTree（）

tree = etree.parse（source，parser = parser）
for p in tree.findall（'.// p'）：
print ET.tostring（p，encoding = 'UTF-8'）

I'm trying to parse, manipulate, and output HTML using Python's ElementTree:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import entitydefs

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

When I run this using Python 2.7 on Mac OS X 10.6, I get:

<p>Less than &lt;</p>

Traceback (most recent call last):
  File "bar.py", line 20, in <module>
    print ET.tostring(p, encoding='utf-8')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1120, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 931, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1067, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)

I thought that specifying "encoding='UTF-8'" would take care of the non-breaking space character, but apparently it doesn't. What should I do instead?

解决方案

0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.

The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint

The version below should fix it for you:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import name2codepoint

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

这篇关于当使用UTF-8输出时，Python ElementTree不会转换不间断的空格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！