我知道 SOUNDEX 和(双)Metaphone,但这些不能让我测试整个单词的相似性 - 例如嗨"听起来与再见"非常相似,但这两种方法会将它们标记为完全不同.

I'm aware of SOUNDEX and (double) Metaphone, but these don't let me test for the similarity of words as a whole - for example "Hi" sounds very similar to "Bye", but both of these methods will mark them as completely different.

Ruby 中是否有任何库或您知道的任何方法能够确定两个单词之间的相似性?(要么是布尔值是/不相似,要么是数字 40% 相似)

Are there any libraries in Ruby, or any methods you know of, that are capable of determining the similarity between two words? (Either a boolean is/isn't similar, or numerical 40% similar)


edit: Extra bonus points if there is an easy method to 'drop in' a different dialect or language!


我认为你在描述 levenshtein 距离.是的,有宝石可以做到这一点.如果您喜欢纯 Ruby,请选择 text gem.

I think you're describing levenshtein distance. And yes, there are gems for that. If you're into pure Ruby go for the text gem.

$ gem install text

文档 有更多详细信息,但关键在于:

The docs have more details, but here's the crux of it:

Text::Levenshtein.distance('test', 'test')    # => 0
Text::Levenshtein.distance('test', 'tent')    # => 1


If you're ok with native extensions...

$ gem install levenshtein

用法类似.它的性能非常好.(它在我的系统上每分钟处理约 1000 次拼写更正.)

It's usage is similar. It's performance is very good. (It handles ~1000 spelling corrections per minute on my systems.)


If you need to know how similar two words are, use distance over word length.


If you want a simple similarity test, consider something like this:


String.module_eval do
   def similar?(other, threshold=2)
    distance = Text::Levenshtein.distance(self, other)
    distance <= threshold

07-24 16:17