本文介绍了使用 goose 阅读文章内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我试图从 .html 文件中读取(为了方便示例,在此处指定了 url)[1].但有时它不显示任何文本.请帮我解决这个问题.

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

使用的 Goose 版本:https://github.com/agolo/python-goose/当前版本存在一些错误.

Goose version used:https://github.com/agolo/python-goose/Present version gives some errors.

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text

推荐答案

Goose 确实使用了几个预定义的元素,这些元素很可能是查找顶部节点的良好起点.如果没有找到已知"元素,它就会开始寻找 top_node,它通常是一个包含许多 p 标签的元素.您可以阅读 extractors/content.py 了解更多详情.

Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node which in general is an element containing a lot of p tags inside it. You can read extractors/content.py for more details.

给定的文章没有普通文章的很多特征,它通常包含在文章标签或带有类和 id 的 div 标签中,例如post-content"、story-body"、article"等.它是一个带有 id = 'docText' 的 div 标签并且没有段落,因此 Goose 无法预测它的好处.

The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText' and has no paragraphs, thus Goose cannot predict a good thing about it.

我建议你在extractors/content.pyKNOWN_ARTICLE_CONTENT_TAGS常量的开头添加这一行:

What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS constant in extractors/content.py:

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'id', 'value': 'docText'},
    ... other paths go here
]

这是提取的正文:

钦奈,12 月 19 日——泰米尔纳德邦政府周一任命了一名单人司法调查委员会调查原因周日在州首府钦奈发生踩踏事件,造成 42 人死亡,造成另外 37 人受伤.\n\n宣布成立即使在踩踏事件中遇难者的家属时,委员会也来了为突如其来的悲剧感到痛苦和不安.\n\n42 名无家可归者人在分配洪水救济时被践踏致死在泰米尔纳德邦首府的一个避难所提供物资.\n\n官员说5000多人冲进避难所的大门,引发踩踏事件.\n\n受害者的家人 Chitra 说是管理不善导致了悲剧.\u2026

这篇关于使用 goose 阅读文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-08 15:54