本文介绍了Beautifulsoup 分解()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 beatifulsoup 去除 <script> 标签和标签内的内容.我去了文档,似乎是一个非常简单的调用函数.有关该功能的更多信息,请参见此处.这是我目前解析的html页面的内容...

想象一下,您有一些类似的 Web 内容,并且在名为 soup_html 的 BeautifulSoup 对象中有这些内容.如果我运行 soup_html.script.decompose() 并且他们调用对象 soup_html 脚本标签仍然存在.我如何摆脱 和这些标签内的内容?

markup = '上面的html'汤 = BeautifulSoup(标记)html_body = 汤.body汤.脚本.分解()html_body
解决方案

这只会从汤"中删除单个脚本元素.相反,我认为您打算分解所有这些:

用于汤中的脚本(脚本"):脚本.分解()

I'm trying to get rid of <script> tags and the content inside the tag utilizing beatifulsoup. I went to the documentation and seems to be a really simple function to call. More information about the function is here. Here is the content of the html page that I have parsed so far...

<body class="pb-theme-normal pb-full-fluid">
    <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important;
    height: 1px !important;
    position: absolute !important;
    left: -10000px !important;
    top: -1000px !important;
    ">
</div>
<div id="pb-f-a">
</div>
    <div class="" id="pb-root">
    <script>
    (function(a){
        TWP=window.TWP||{};
        TWP.Features=TWP.Features||{};
        TWP.Features.Page=TWP.Features.Page||{};
        TWP.Features.Page.PostRecommends={};
        TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callbackx3d?";
        TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callbackx3d?";
        TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments";
        TWP.Features.Page.PostRecommends.canonicalUrl=""
    })(jQuery);

    </script>
    </div>
</body>

Imagine you have some web content like that and you have that in a BeautifulSoup object called soup_html. If I run soup_html.script.decompose() and them call the object soup_html the script tags still there. How I can get rid of the <script> and the content inside those tags?

markup = 'The html above'
soup = BeautifulSoup(markup)
html_body = soup.body

soup.script.decompose()

html_body
解决方案

This would remove a single script element from the "Soup" only. Instead, I think you meant to decompose all of them:

for script in soup("script"):
    script.decompose()

这篇关于Beautifulsoup 分解()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-03 22:02