问题描述
我熟悉SimHash和MinHash的LSH(局部敏感哈希)技术. SimHash对实际值数据使用余弦相似度. MinHash计算二进制矢量上的相似度相似度.但是我无法决定哪个更适合使用.
I'm familiar with the LSH (Locality Sensitive Hashing) techniques of SimHash and MinHash. SimHash uses cosine similarity over real-valued data. MinHash calculates resemblance similarity over binary vectors. But I can't decide which one would be better to use.
我正在为网站创建一个后端系统,以查找几乎重复的半结构化文本数据.例如,每条记录将具有标题,位置和简短的文本描述(少于500个字).
I am creating a backend system for a website to find near duplicates of semi-structured text data. For example, each record will have a title, location, and a brief text description (<500 words).
除了特定的语言实现之外,哪种算法最适合未开发的生产系统?
Specific language implementation aside, which algorithm would be best for a greenfield production system?
推荐答案
Simhash 更快(非常快)并且通常需要较少的存储空间,但是对两个文档之间可以有不同的方式施加了严格的限制,并且仍被检测为重复项.如果您使用的是64位simhash(通常的选择),并且取决于您能够存储多少个置换表,则可能会将汉明距离限制为低至3或高至6或7.小汉明距离!您将只能检测几乎相同的文档,即使那样,您可能仍需要对选择加入simhash的哪些功能以及赋予它们的权重进行一些仔细的调整.
Simhash is faster (very fast) and typically requires less storage, but imposes a strict limitation on how dissimilar two documents can be and still be detected as duplicates. If you are using a 64-bit simhash (a common choice), and depending on how many permuted tables you are capable of storing, you might be limited to hamming distances of as low as 3 or possibly as high as 6 or 7. Those are small hamming distances! You'll be limited to detecting documents that are mostly identical, and even then you may need to do some careful tuning of what features you choose to go into the simhash and what weightings you give to them.
类似物的产生已获得Google的专利保护,尽管在实践中它们似乎至少允许非商业用途.
The generation of simhashes is patented by google, though in practice they seem to allow at least non-commercial use.
Minhash 使用更多的内存,因为通常每个文档存储50-400个散列,并且CPU效率不如simhash,但它可以让您找到相距遥远的相似之处,例如如果需要,估计的相似性低至5%.比simhash还容易理解,特别是在表的工作方式方面.它的实现非常简单,通常使用混叠,不需要太多的调整即可获得良好的结果. (据我所知)它没有专利.
Minhash uses more memory, since you'd be typically storing 50-400 hashes per document, and it's not as CPU-efficient as simhash, but it allows you to find quite distant similarities, e.g. as low as 5% estimated similarity, if you want. It's also a bit easier to understand than simhash, particularly in terms of how the tables work. It's quite straightforward to implement, typically using shingling, and doesn't need a lot of tuning to get good results. It's not (to my knowledge) patented.
如果您要处理大数据,则minhash方法中占用CPU最多的部分可能是 之后,当您在搜索文档时产生了哈希值表格以查找共享其某些哈希值的其他文档.可能有成千上万的文档与其共享至少一个散列,并且您必须仔细研究所有这些散列,以找到那些共享少量哈希的文档,例如至少一半的哈希值. Simhash在这里要快得多.
If you're dealing with big data, the most CPU-intensive part of the minhash approach will likely be after you've generated the minhashes for your document, when you're hunting through your table to find other documents that share some of its hashes. There may be tens or hundreds of thousands of documents that share at least one hash with it, and you've got to weed through all of these to find those few that share e.g. a minimum of half its hashes. Simhash is a lot quicker here.
正如Otmar在下面的评论中指出的那样,minhash进行了优化,可以使您的相似度估计达到相同的精度,并且每个文档的哈希数更少.这样可以大大减少您必须除草的数量.
As Otmar points out in his comment below, there are optimizations of minhash that allow you to achieve the same precision on your similarity estimates with fewer hashes per document. This can substantially reduce the amount of weeding you have to do.
我现在尝试了 superminhash .尽管我使用单个方法实现了minhash ,但速度相当快哈希函数加上位转换以生成所有其他哈希对于我来说是更快的.它提供了更准确的jaccard估算值,在我测试过的某些情况下大约提高了15%(尽管在其他情况下几乎没有区别).这应该意味着您需要减少大约三分之一的哈希值才能达到 same 的准确性.在表中存储较少的哈希值意味着需要较少的除草"来识别几乎重复的数据,从而显着提高了速度.我不知道有关superminhash的任何专利.谢谢奥特玛!
I have now tried superminhash. It's fairly fast, though my implementation of minhash using a single hash function plus bit-transformations to produce all the other hashes was faster for my purposes. It offers more accurate jaccard estimates, about 15% better under some situations I tested (though almost no difference under others). This should mean you need about a third fewer hashes to achieve the same accuracy. Storing fewer hashes in your table means less "weeding" is needed to identify near duplicates, which delivers a significant speed-up. I'm not aware of any patent on superminhash. Thanks Otmar!
这篇关于在生产系统的SimHash和MinHash之间选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!