如何在Python中分析格式不正确的HTML | 如何在Python中分析格式不正确的HTML

本文介绍了如何在Python中分析格式不正确的HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在使用lxml解析字符串之前使用uTidyLib

a = tidy.parseString（html_code，options）
dom = etree.fromstring（str（a））

一个错误，似乎tidylib无法修复格式不正确的html。

如何解析每个HTML文件而不会出现错误（仅解析可以解析的部分文件不被修复）？

解决方案

在无效/破坏的HTML方面做得很好

 >>> ; from BeautifulSoup import BeautifulSoup 
>>> soup = BeautifulSoup（< htm @）（$ *>< body>< table< tr>< td> hi< / tr>< / td>< / body>< html>） 
>>> print soup.prettify（）
< htm> 
< body> 
< table> 
< tr> 
< td> 
 hi 
< / td> 
< / tr> 
< / table> 
< / body> 
< / htm>

I need to browse the DOM tree of a parsed HTML document.

I'm using uTidyLib before parsing the string with lxml

a = tidy.parseString(html_code, options)dom = etree.fromstring(str(a))

sometimes I get an error, it seems that tidylib is not able to repair malformed html.

how can I parse every HTML file without getting an error (parsing only some parts of files that can not be repaired)?

解决方案

Beautiful Soup does a good job with invalid/broken HTML

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<htm@)($*><body><table <tr><td>hi</tr></td></body><html")
>>> print soup.prettify()
<htm>
 <body>
  <table>
   <tr>
    <td>
     hi
    </td>
   </tr>
  </table>
 </body>
</htm>

这篇关于如何在Python中分析格式不正确的HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！