问题描述
鉴于HTML网页是一篇文字繁重的文章,我想识别并解析出主要内容.
Given an HTML page that is a text heavy article, I would like to identify and parse out the primary content.
使用 http://www.fivethirtyeight.以com/2009/08/chavismo-obama-and-monroe-doctrine.html 为例,我要标识div#post-4438372351887392855,其中包含标题和文章.
Using http://www.fivethirtyeight.com/2009/08/chavismo-obama-and-monroe-doctrine.html as an example, I want to identify div#post-4438372351887392855, which contains the title and article.
我知道什么都不是完美的,或者不可能100%地起作用,但是有没有一种方法可以在合理的情况下为我提供理想的结果呢?
I know nothing can be perfect or work 100% of the time, but is there an approach that can give me the desired result in a reasonable number of circumstances?
我目前的想法是遍历每个div,剥离标记,然后找到包含最多文本的最里面的div.
My present thought is to iterate through each div, stripping out the markup, then finding the inner-most div that contains the most text.
至此,我才刚刚起步,因此,我可以寻求概念上的投入.或者,如果有东西,那么开源库就不错了.
At this point, I'm just getting started, so looking for input I can put towards a conceptual approach. Or, if something is out there, an open source library would be nice.
提前感谢您的见解.
推荐答案
arc90的某些人通过他们的可读性书签.查找主要"内容似乎做得很好-可以完美地在您列出的页面上使用.
您可以浏览他们的注释良好的javascript(链接到小书签中),但是您可能需要与开发人员联系,以获取他们的想法和使用它们的权限.
Some folks at arc90 have done a pretty impressive job with this with their readability bookmarklet.It seems to do a pretty good job of finding the 'main' content -- works on the page you list perfectly.
You can look through their well commented javascript (linked to in the bookmarklet), but you might want to contact the developers for their ideas and permission to use them.
这篇关于识别页面的主要内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!