问题描述
我有一个表人员,其中包含个人数据等.有很多列,但这里感兴趣的是:addressindex
、lastname
和 firstname
其中 addressindex
是深入到公寓门口的唯一地址.因此,如果我有像下面这样"两个 lastname
和一个 firstnames
的人,他们很可能是重复的.
I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex
, lastname
and firstname
where addressindex
is a unique address drilled down to the door of the apartment.So if I have 'like below' two persons with the lastname
and one the firstnames
are the same they are most likely duplicates.
我需要一种方法来列出这些重复项.
I need a way to list these duplicates.
tabledata:
personid 1
firstname "Carl"
lastname "Anderson"
addressindex 1
personid 2
firstname "Carl Peter"
lastname "Anderson"
addressindex 1
如果我要在所有列上完全匹配,我知道该怎么做,但我需要模糊匹配来使用(来自上面的示例)结果如下:
I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:
Row personid addressindex lastname firstname
1 2 1 Anderson Carl Peter
2 1 1 Anderson Carl
.....
有关如何以好的方式解决此问题的任何提示?
Any hints on how to solve this in a good way?
推荐答案
我发现 SQL Server 提供给您进行模糊匹配的东西非常笨拙.我使用 Levenshtein 距离算法和一些加权在我自己的 CLR 函数上真的很幸运.使用该算法,我创建了一个名为 GetSimilarityScore 的 UDF,它接受两个字符串并返回一个介于 0.0 和 1.0 之间的分数.匹配度越接近 1.0 越好.然后,使用 >=0.8 左右的阈值进行查询以获得最可能的匹配项.像这样:
I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I've had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I've then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and 1.0. The closer to 1.0 the match is, the better. Then, query with a threshold of >=0.8 or so to get the most likely matches. Something like this:
if object_id('tempdb..#similar') is not null drop table #similar
select a.id, (
select top 1 x.id
from MyTable x
where x.id <> a.id
order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc
) as MostSimilarId
into #similar
from MyTable a
select *, dbo.GetSimilarityScore(a.MyField, c.MyField)
from MyTable a
join #similar b on a.id = b.id
join MyTable c on b.MostSimilarId = c.id
只是不要用非常大的桌子来做.这是一个缓慢的过程.
Just don't do it with really large tables. It's a slow process.
这是 CLR UDF:
Here's the CLR UDFs:
''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value
Dim s2 As String = string2.Value
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d As Integer(,) = New Integer(n, m) {}
' Step 1
If n = 0 Then Return m
If m = 0 Then Return n
' Step 2
For i As Integer = 0 To n
d(i, 0) = i
Next
For j As Integer = 0 To m
d(0, j) = j
Next
' Step 3
For i As Integer = 1 To n
'Step 4
For j As Integer = 1 To m
' Step 5
Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)
' Step 6
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
' Step 7
Return d(n, m)
End Function
''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too
Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)
Dim letterS1 As String = GetLetterSimilarityString(s1)
Dim letterS2 As String = GetLetterSimilarityString(s2)
Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)
'Dim wordS1 As String = GetWordSimilarityString(s1)
'Dim wordS2 As String = GetWordSimilarityString(s2)
'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)
If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F
If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F
' Return weighted result
Return (flatLevScore * 0.2F) + (letterScore * 0.8F)
End Function
Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double
Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
If maxLen = 0 Then Return 1.0F
Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function
''' <summary>
''' Sorts all the alpha numeric characters in the string in alphabetical order
''' and removes everything else.
''' </summary>
Private Shared Function GetLetterSimilarityString(s1 As String) As String
Dim allChars = If(s1, "").ToUpper().ToCharArray()
Array.Sort(allChars)
Dim result As New StringBuilder()
For Each ch As Char In allChars
If Char.IsLetterOrDigit(ch) Then
result.Append(ch)
End If
Next
Return result.ToString()
End Function
''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetWordSimilarityString(s1 As String) As String
Dim words As New List(Of String)()
Dim curWord As StringBuilder = Nothing
For Each ch As Char In If(s1, "").ToUpper()
If Char.IsLetterOrDigit(ch) Then
If curWord Is Nothing Then
curWord = New StringBuilder()
End If
curWord.Append(ch)
Else
If curWord IsNot Nothing Then
words.Add(curWord.ToString())
curWord = Nothing
End If
End If
Next
If curWord IsNot Nothing Then
words.Add(curWord.ToString())
End If
words.Sort(StringComparer.OrdinalIgnoreCase)
Return String.Join(" ", words.ToArray())
End Function
这篇关于使用 T-SQL 进行模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!