来自 google blogspot 的引用,
"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "
Google 如何检测那些完全重复的网页或文档?关于 Google 使用的算法的任何想法?
最佳答案
根据 http://en.wikipedia.org/wiki/MinHash :
搜索 Simhash 会出现此页面:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
其中引用了谷歌员工撰写的一篇论文:Detecting near-duplicates for web crawling
摘要:
另一篇 Simhash 论文:
http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf
关于algorithm - 检测大量 URL 中的重复网页,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/18615748/