本文介绍了如何在Python中分析格式不正确的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在使用lxml解析字符串之前使用uTidyLib
a = tidy.parseString(html_code,options)
dom = etree.fromstring(str(a))
一个错误,似乎tidylib无法修复格式不正确的html。
如何解析每个HTML文件而不会出现错误(仅解析可以解析的部分文件不被修复)?
解决方案
在无效/破坏的HTML方面做得很好
>>> ; from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(< htm @)($ *>< body>< table< tr>< td> hi< / tr>< / td>< / body>< html>)
>>> print soup.prettify()
< htm>
< body>
< table>
< tr>
< td>
hi
< / td>
< / tr>
< / table>
< / body>
< / htm>
I need to browse the DOM tree of a parsed HTML document.
I'm using uTidyLib before parsing the string with lxml
a = tidy.parseString(html_code, options)dom = etree.fromstring(str(a))
sometimes I get an error, it seems that tidylib is not able to repair malformed html.
how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)?
解决方案
Beautiful Soup does a good job with invalid/broken HTML
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html")
>>> print soup.prettify()
<htm>
<body>
<table>
<tr>
<td>
hi
</td>
</tr>
</table>
</body>
</htm>
这篇关于如何在Python中分析格式不正确的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!