但是快速实验似乎暗示了其他情况:Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32Type "help", "copyright", "credits" or "license" for more information.>>> from lxml import etree>>> parser = etree.HTMLParser()>>> from urllib.request import urlopen>>> with urlopen('https://pypi.python.org/simple') as f:... tree = etree.parse(f, parser)...>>> tree2 = etree.parse('https://pypi.python.org/simple', parser)Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655) File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263) File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564) File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561) File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456) File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543) File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003) File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015)OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple">>>我可以使用urlopen方法,但是文档似乎暗示传递URL会更好.另外,如果文档不准确,尤其是当我开始需要做更复杂的事情时,我会担心依赖lxml.从已知URL用lxml解析HTML的正确方法是什么?我应该在哪里查看该文件的记录? 更新:如果我使用http URL而不是https URL,则会出现相同的错误.解决方案问题是lxml不支持HTTPS网址,并且 http://pypi.python.org/simple 重定向到HTTPS版本.因此,对于任何安全的网站,您需要自己阅读网址:from lxml import etreefrom urllib.request import urlopenparser = etree.HTMLParser()with urlopen('https://pypi.python.org/simple') as f: tree = etree.parse(f, parser)The documentation says I can:(from http://lxml.de/parsing.html under "Parsers")but a quick experiment seems to imply otherwise:Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32Type "help", "copyright", "credits" or "license" for more information.>>> from lxml import etree>>> parser = etree.HTMLParser()>>> from urllib.request import urlopen>>> with urlopen('https://pypi.python.org/simple') as f:... tree = etree.parse(f, parser)...>>> tree2 = etree.parse('https://pypi.python.org/simple', parser)Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655) File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263) File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564) File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561) File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456) File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543) File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003) File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015)OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple">>>I can use the urlopen method, but the documentation seems to imply that passing a URL is somehow better. Also, I'm a bit concerned about relying on lxml if the documentation is inaccurate, particularly if I start needing to do anything more complex.What is the correct way to parse HTML with lxml, from a known URL? And where should I be looking to see that documented?Update: I get the same error if I use a http URL rather than a https one. 解决方案 The issue is that lxml does not support HTTPS urls, and http://pypi.python.org/simple redirects to a HTTPS version.So for any secure website, you need to read the URL yourself:from lxml import etreefrom urllib.request import urlopenparser = etree.HTMLParser()with urlopen('https://pypi.python.org/simple') as f: tree = etree.parse(f, parser) 这篇关于我可以在Python 3上提供指向lxml.etree.parse的URL吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-23 21:09