带br标签的Beautifulsoup同级结构

带br标签的Beautifulsoup同级结构

本文介绍了带br标签的Beautifulsoup同级结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup Python库解析HTML文档,但是该结构由于<br>标签而变得失真.让我给你举个例子.

I'm trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example.

输入HTML:

<div>
  some text <br>
  <span> some more text </span> <br>
  <span> and more text </span>
</div>

BeautifulSoup解释的HTML:

HTML that BeautifulSoup interprets:

<div>
  some text
  <br>
    <span> some more text </span>
    <br>
      <span> and more text </span>
    </br>
  </br>
</div>

在源代码中,跨度可被视为同级.解析后(使用默认的解析器),由于br标签成为了结构的一部分,因此跨度突然不再是同级的.

In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags became part of the structure.

我可以想到的解决方案是在将html倒入Beautifulsoup之前,完全剥离<br>标记,但这似乎不太优雅,因为这需要我更改输入.有什么更好的方法来解决这个问题?

The solution I can think of to solve this is to strip the <br> tags altogether, before pouring the html into Beautifulsoup, but that doesn't seem very elegant, as it requires me to change the input. What's a better way to solve this?

推荐答案

您最好的选择是extract()换行.比您想象的要容易:).

Your best bet is to extract() the line breaks. It's easier than you think :).

>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
...   some text <br>
...   <span> some more text </span> <br>
...   <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
...     linebreak.extract()
...
<br/>
<br/>
>>> print soup.prettify()
<html>
 <body>
  <div>
   some text
   <span>
    some more text
   </span>
   <span>
    and more text
   </span>
  </div>
 </body>
</html>

这篇关于带br标签的Beautifulsoup同级结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 07:50