问题描述
基本上,我想使用 BeautifulSoup 来严格抓取网页上的可见文本.例如,这个网页就是我的测试案例.而且我主要只想获取正文(文章),甚至可能在这里和那里获取一些选项卡名称.我已经尝试了这个 SO question 中的建议返回许多我不想要的 标签和 html 注释.我无法弄清楚函数
findAll 所需的参数()
以便只获取网页上的可见文本.
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script>
tags and html comments which I don't want. I can't figure out the arguments I need for the function findAll()
in order to just get the visible texts on a webpage.
那么,我应该如何找到除脚本、评论、CSS 等之外的所有可见文本?
So, how should I find all visible text excluding scripts, comments, css etc.?
推荐答案
试试这个:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
这篇关于BeautifulSoup 抓取可见网页文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!