问题描述
我正在从 html 文件中读取文本并进行一些分析.这些 .html 文件是新闻文章.
I am reading text from html files and doing some analysis. These .html files are news articles.
代码:
html = open(filepath,'r').read()
raw = nltk.clean_html(html)
raw.unidecode(item.decode('utf8'))
现在我只需要文章内容,而不是其他文本,如广告、标题等.如何在 python 中相对准确地做到这一点?
Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?
我知道一些工具,比如 Jsoup(一个 java api)和 bolier 但是我想在 python 中这样做.我可以使用 bs4 找到一些技术,但是仅限于一种类型的页面.我有来自众多来源的新闻页面.此外,还缺乏任何示例代码示例.
I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.
我正在寻找与此完全相同的内容 http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf 在 python 中.
I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.
为了更好的理解,请写一个示例代码来提取以下链接的内容http://www.nytimes.com/2015/05/19/health/study-finds-密集乳房组织不总是 a-high-cancer-risk.html?src=me&ref=general
To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general
推荐答案
Python 中也有这方面的库 :)
There are libraries for this in Python too :)
既然你提到了 Java,就有一个用于boilerpipe 的 Python 包装器,允许你直接在 python 脚本中使用它:https://github.com/misja/python-boilerpipe
Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe
如果你想使用纯 python 库,有两个选项:
If you want to use purely python libraries, there are 2 options:
https://github.com/buriy/python-readability
和
https://github.com/grangier/python-goose
在这两者中,我更喜欢 Goose,但是请注意,它的最新版本有时会出于某种原因无法提取文本(我的建议是现在使用 1.0.22 版本)
Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)
这是使用 Goose 的示例代码:
here's a sample code using Goose:
from goose import Goose
from requests import get
response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
这篇关于从存储的 .html 页面中提取新闻文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!