问题描述
我想使用字符串相似性函数在数据库中查找损坏的数据.
I want to use string similarity functions to find corrupted data in my database.
我遇到了其中几个:
- Jaro,
- Jaro-Winkler,
- Levenshtein,
- 欧几里得和
- Q-gram
我想知道它们之间有什么区别,以及它们在什么情况下最有效?
I wanted to know what is the difference between them and in what situations they work best?
推荐答案
在勘误表和注意到一些适用于类似问题空间的算法的可比性,让我们先探讨这些算法的适用性,然后再确定它们是否适用.在数值上是可比的.
Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
从Wikipedia, Jaro-Winkler :
From Wikipedia, Jaro-Winkler:
两个琴弦之间的Levenshtein距离定义为最小 将一个字符串转换为另一个字符串所需的编辑次数,其中 允许的编辑操作是插入,删除或 替换单个字符.它以弗拉基米尔·弗拉基米尔(Vladimir)命名 莱文施泰因(Levenshtein),他在1965年考虑了这一距离.
The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.
两个核心 n元语法模型的优点(以及使用 它们)是相对简单的,并且具有扩展的能力-只需 增加一个模型可以用来存储更多的上下文 易于理解的时空权衡,可以进行小型实验 非常有效地扩展.
The two core advantages of n-gram models (and algorithms that use them) are relative simplicity and the ability to scale up – by simply increasing n a model can be used to store more context with a well-understood space–time tradeoff, enabling small experiments to scale up very efficiently.
问题在于这些算法解决了所有可能解决算法数据中或嫁接可用的度量标准.实际上,并非所有这些指标甚至都是 metrics ,因为其中一些指标不能满足三角形不等式.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
正确地做到这一点:使用校验和和奇偶校验位作为数据.当更简单的解决方案可以解决时,不要尝试解决更困难的问题.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.
这篇关于比较相似度算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!