python-2.7 - 解析包含和使用 Python 的 HTML 页面

我正在尝试使用 urllib2 和 ElementTree 在 python 中解析 HTML 页面，但在解析 HTML 时遇到了麻烦。网页在带引号的字符串中包含“&”，但 ElementTree 对包含 & 的行抛出 parseError脚本:import urllib2url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'req = urllib2.Request(url, headers={'Content-type': 'text/xml'})r = urllib2.urlopen(req).read()import xml.etree.ElementTree as EThtmlpage=ET.fromstring(r)这会在 Python 2.7 中引发以下错误Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1624, in feed File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerrorxml.etree.ElementTree.ParseError: not well-formed (invalid token): line 676, column 73错误对应于以下行<input type="hidden" id="HdnFldAndamanNicobar" value="1,Andaman & Nicobar Islands;" />看起来在读取 HTML 页面时，& 符号未解析为变量 r 中的 &我尝试使用 R 程序使用 htmlTreeParse 进行解析，并且“&”正确转换为 &。如果我在 urllib2 中遗漏了什么，请告诉我编辑:我将“&”替换为 & 但第 904 行包含 LINE:904 for (i = 0; i < strac.length - 1; i++) { 最佳答案首先，xml.etree.ElementTree 是一个 XML 解析器。它不处理开箱即用的 HTML 实体。 & 是 an illegal thing to have inside the XML 这就是它失败的原因。开始使用真正的专业 HTML 解析器 BeautifulSoup :>>> from urllib2 import urlopen>>> from bs4 import BeautifulSoup>>> url = 'http://eciresults.nic.in/ConstituencywiseU011.htm'>>> soup = BeautifulSoup(urlopen(url))>>> soup.find('td').text.strip()u'ELECTION COMMISSION OF INDIA'也可以看看: How to parse malformed HTML in python, using standard libraries关于python-2.7 - 解析包含和使用 Python 的 HTML 页面，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/23707647/