I was wondering how Google Reader extracts news items from a web page.
Does any of you know how it works? Or how someone can build a similar system to extract the same information from the HTML of a web page.
显然,这是不使用标准(也不是唯一的阅读RSS / ATOM),因为谷歌阅读器证明了它可以读取网页的内容,不管如何标记的样子。
Obviously it is not using a standard (nor is it only reading RSS/ATOM), because Google Reader proves that it can read the content of the page regardless of how the markup looks like.
Google Reader does not currently do any kind of extraction of content from raw web pages. It used to have a "track changes to arbitrary pages" feature, but that was removed more than a year ago.
当考虑到是不是饲料的网址,谷歌阅读器读取其中的内容。如果内容是HTML,它会寻找一个的表单元素<链接相对=交替式=应用程序/原子+ XML的href =feed.xml>
When given an URL that is not that of a feed, Google Reader fetches its contents. If the contents are HTML, it looks for an autodiscovery element of the form <link rel="alternate" type="application/atom+xml" href="feed.xml">
. If found, it subscribes to the feed.