本文介绍了Beautifulsoup 4:删除评论标签及其内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我要剪贴的页面包含这些html代码.如何删除注释标签<!-- -->及其包含 bs4 的内容?

So the page that I'm scrapping contains these html codes. How do I remove the comment tag <!-- --> along with its content with bs4?

<div class="foo">
cat dog sheep goat
<!--
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->

</div>

推荐答案

您可以使用 extract() (解决方案基于此答案):

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

因此,您得到的div没有注释:

As a result you get your div without comments:

<div class="foo">
    cat dog sheep goat
</div>

这篇关于Beautifulsoup 4:删除评论标签及其内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 19:51
查看更多