我在寻找一种算法,可以生成一个短(FX 16个字符(并不重要),散列code /从一个更长的字符串消化。
I'm looking for an algorithm which can generate a short (fx 16 chars (not important) hashcode/digest from a longer string.
The main requirement is that strings which is almost identical should result in the same digest.
Fx 2 almost identical mail:
马丁嗨。这里有一些......垃圾邮件给你。问候XYZ。=> AAAA AAAA AAAA AAAA
Hi Martin. Here are some ... spam for you. Regards XYZ.=> AAAA AAAA AAAA AAAA
博你好。这里有一些......垃圾邮件给你。问候EFG。=> AAAA AAAA AAAA AAAA
Hi Bo. Here are some ... spam for you. Regards EFG.=> AAAA AAAA AAAA AAAA
returns the same diges (or almost the same), where as a different mail:
您好芬兰人。这是一个测试邮件。=> CCCC CCCC CCCC CCCC
Hello Finn. This is a test mail.=> CCCC CCCC CCCC CCCC
This algorithm would be part of a spam filter. The filter will remember digests from mails which it is certain is spam. If the same digest shows up in mails where it is in doubt, the identical digest will cause the filter to increase the spamscore.
I know about Levenshtein, but it requires me to know the strings up front. In this situation i do not have this information. I could have this information, but that would require the filter for store all spam e-mail and check against each one, which would be a very slow process.
Maybe some loose compression algorithm coupled with a calc of the Levenshtein distance between the two could work.
任何指针AP preciated。
Any pointers appreciated.
它看起来像你想的本地敏感散列 。考虑使用 minhash 或搭迭。还有无论是在拉贾拉曼和放大器的很好的解释;厄尔曼的书,的挖掘海量数据集的。你会发现无数的,短的实现在Python搜索博客上面的关键字。
It looks like you want locality-sensitive hashing. Consider using minhash or shingling. There's a great explanation of both in Rajaraman & Ullman's book, Mining Massive Datasets. You'll find numerous, short implementations in python searching blogs for the keywords above.
There seem to be other approaches to this (that I don't know much about), but that may be of interest to you since they are specially tailored for spam messages, in particular the nilsimsa hash:
- explained in that paper
- which has a python port on pypi