algorithm - 检测大量 URL 中的重复网页

来自 google blogspot 的引用，

"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "

Google 如何检测那些完全重复的网页或文档？关于 Google 使用的算法的任何想法？

最佳答案

根据 http://en.wikipedia.org/wiki/MinHash :

搜索 Simhash 会出现此页面:

https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/

https://github.com/leonsim/simhash

其中引用了谷歌员工撰写的一篇论文:Detecting near-duplicates for web crawling

摘要:

另一篇 Simhash 论文:

http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf

关于algorithm - 检测大量 URL 中的重复网页，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/18615748/