问题描述
我想知道谷歌阅读器如何提取新闻内容从网页上。
I was wondering how Google Reader extracts news items from a web page.
是否有任何人都知道它是如何工作的?要不怎么有人可以建立一个类似的系统,提取网页的HTML相同的信息。
Does any of you know how it works? Or how someone can build a similar system to extract the same information from the HTML of a web page.
显然,这是不使用标准(也不是唯一的阅读RSS / ATOM),因为谷歌阅读器证明了它可以读取网页的内容,不管如何标记的样子。
Obviously it is not using a standard (nor is it only reading RSS/ATOM), because Google Reader proves that it can read the content of the page regardless of how the markup looks like.
推荐答案
谷歌阅读器目前没有做任何形式的提取从原材料的网页内容。它曾经有一个轨道变为任意页面功能,但那是删除比去年同期多。
Google Reader does not currently do any kind of extraction of content from raw web pages. It used to have a "track changes to arbitrary pages" feature, but that was removed more than a year ago.
当考虑到是不是饲料的网址,谷歌阅读器读取其中的内容。如果内容是HTML,它会寻找一个的表单元素<链接相对=交替式=应用程序/原子+ XML的href =feed.xml>
。如果找到,就订阅了饲料。
When given an URL that is not that of a feed, Google Reader fetches its contents. If the contents are HTML, it looks for an autodiscovery element of the form <link rel="alternate" type="application/atom+xml" href="feed.xml">
. If found, it subscribes to the feed.
这篇关于如何谷歌阅读器提取网页中的新闻条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!