如何抓取/索引经常更新的网页的策略?

本文介绍了如何抓取/索引经常更新的网页的策略?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试构建一个非常小的、小众的搜索引擎，使用 Nutch 来抓取特定的站点.一些网站是新闻/博客网站.如果我抓取，比如说，techcrunch.com，并存储和索引他们的首页或他们的任何主要页面，那么在几小时内我对该页面的索引就会过时.

像谷歌这样的大型搜索引擎是否有一种算法可以非常频繁地、甚至每小时重新抓取频繁更新的页面?或者它只是对经常更新的页面评分很低，因此它们不会被退回?

如何在我自己的应用程序中处理这个问题?

解决方案

好问题.这实际上是 WWW 研究社区中的一个活跃话题.所涉及的技术称为重新抓取策略或页面刷新策略.

据我所知，文献中考虑了三个不同的因素:

更改频率(网页内容更新的频率)
- [1]:形式化数据新鲜度"的概念，并使用泊松过程来模拟网页的变化.
- [2]:频率估计器
- [3]:更多的调度策略
相关性(更新的页面内容对搜索结果的影响有多大)
- [4]:最大限度地提高搜索引擎查询者的用户体验质量
- [5]:确定(几乎)最佳的爬行频率
信息寿命(随着时间的推移从网页中出现和消失的内容片段的寿命，显示与更改频率没有强相关性)
- [6]:区分临时和持久内容

您可能想要决定哪个因素对您的应用程序和用户更重要.然后您可以查看以下参考以了解更多详细信息.

编辑:为了帮助您入门，我简要讨论了 [2] 中提到的频率估计器.基于此，您应该能够找出其他论文中可能对您有用的内容.:)

请按照我在下面指出的顺序阅读本文.只要你知道一些概率和统计数据 101 应该不会太难理解(如果你只采用估算器公式可能会少得多):

第 1 步.请转到部分 6.4 -- 应用到网络爬虫.Cho 在此列出了 3 种估算网页更改频率的方法.

统一政策:抓取工具以每周一次的频率重新访问所有页面.
天真的策略:在前 5 次访问中，爬虫以该频率访问每个页面每周一次.5次访问后，爬虫估计变化频率使用 naive estimator 的页面(第 4.1 节)
我们的政策:抓取工具使用建议的估算器(第 4.2 节)来估算更改频率.

第 2 步.天真的政策.请转到第 4 部分.您将阅读:

直观地，我们可以使用X/T(X:检测到的变化次数，T:监测周期)作为估计频率的变化.

子序列第 4.1 节刚刚证明了这种估计是有偏的, in-consistant and in-efficient.

Step 3. The improved estimator. Please go to section 4.2. The new estimator looks like below:

where \bar X is n - X (the number of accesses that the element did not change) and n is the number of accesses. So just take this formula and estimate the change frequency. You don't need to understand the proof in the rest of the sub-section.

Step 4. There are some tricks and useful techniques discussed in Section 4.3 and Section 5 that might be helpful to you. Section 4.3 discussed how to deal with irregular intervals. Section 5 solved the question: When the last-modication date of an element is available, how can we use it to estimate change frequency? The proposed estimator using last-modification date is shown below:

The explanation to the above algorithm after Fig.10 in the paper is very clear.

Step 5. Now if you have interest, you can take a look at the experiment setup and results in section 6.

So that's it. If you feel more confident now, go ahead and try the freshness paper in [1].

References

[1] http://oak.cs.ucla.edu/~cho/papers/cho-tods03.pdf

[2] http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf

[3] http://hal.inria.fr/docs/00/07/33/72/PDF/RR-3317.pdf

[4] http://wwwconference.org/proceedings/www2005/docs/p401.pdf

[5] http://www.columbia.edu/~js1353/pubs/wolf-www02.pdf

[6] http://infolab.stanford.edu/~olston/publications/www08.pdf

这篇关于如何抓取/索引经常更新的网页的策略?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！