本文介绍了实际工作的Python html解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试图在Python中解析一些html。有些方法实际上在......之前有效,但是现在没有解决方法,我实际上可以使用任何方法。
I'm trying to parse some html in Python. There were some methods that actually worked before... but nowadays there's nothing I can actually use without workarounds.
- SGMLParser去之后beautifulsoup有问题
- html5lib无法解析out there的一半
- lxml试图对典型的html太正确(属性和标签不能包含未知的命名空间,或抛出异常,这意味着几乎没有Facebook连接的页面可以被解析)
还有其他的选择这些天? (如果它们支持xpath,那就太好了)
What other options are there these days? (if they support xpath, that would be great)
推荐答案
确保您使用 html
模块,当您使用 lxml
解析HTML时:
Make sure that you use the html
module when you parse HTML with lxml
:
>>> from lxml import html
>>> doc = """<html>
... <head>
... <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>
所有错误&例外情况会消失,你将得到一个惊人的快速解析器,它比BeautifulSoup更经常处理HTML汤。
All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.
这篇关于实际工作的Python html解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!