BeautifulSoup 抓取可见网页文本

本文介绍了BeautifulSoup 抓取可见网页文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基本上，我想使用 BeautifulSoup 来严格抓取网页上的可见文本.例如，这个网页就是我的测试案例.而且我主要只想获取正文(文章)，甚至可能在这里和那里获取一些选项卡名称.我已经尝试了这个 SO question 中的建议返回许多我不想要的标签和 html 注释.我无法弄清楚函数 findAll 所需的参数() 以便只获取网页上的可见文本.

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

那么，我应该如何找到除脚本、评论、CSS 等之外的所有可见文本?

So, how should I find all visible text excluding scripts, comments, css etc.?

推荐答案

试试这个:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

这篇关于BeautifulSoup 抓取可见网页文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！