问题描述
我有两组数据.现有客户和潜在客户.
I have two sets of data. Existing customers and potential customers.
我的主要目标是弄清楚是否有任何潜在客户已经是现有客户.但是,跨数据集的客户命名约定不一致.
My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent.
现有客户
Customer / ID
Ed's Barbershop / 1002
GroceryTown / 1003
Candy Place / 1004
Handy Man / 1005
潜在客户
Customer
Eds Barbershop
Grocery Town
Candy Place
Handee Man
Beauty Salon
The Apple Farm
Igloo Ice Cream
Ride-a-Long Bikes
我想编写如下类型的select语句来达到我的目标:
I would like to write some type of select statement like below to reach my objective:
SELECT a.Customer, b.ID
FROM PotentialCustomers a LEFT JOIN
ExistingCustomers B
ON a.Customer = b.Customer
结果类似于:
Customer / ID
Eds Barbershop / 1002
Grocery Town / 1003
Candy Place / 1004
Handee Man / 1005
Beauty Salon / NULL
The Apple Farm / NULL
Igloo Ice Cream / NULL
Ride-a-Long Bikes / NULL
我对Levenshtein距离和Double Metaphone的概念含糊其词,但是我不确定如何在此处应用它.
I am vaguely familiar with the concepts of Levenshtein Distance and Double Metaphone but I am not sure how to apply it here.
理想情况下,我希望SELECT语句的JOIN部分读取类似以下内容的内容:LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer
,但我知道语法不正确.
Ideally I would want the JOIN portion of the SELECT statement to read something like: LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer
but I know that syntax is incorrect.
欢迎任何建议.谢谢!
推荐答案
以下是使用Levenshtein Distance的方法:
Here is how this could be done using Levenshtein Distance:
创建此函数:(先执行此操作)
Create this function:(Execute this first)
CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
@j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i,
@cv0 = CAST(@i AS binary(2)),
@j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
(函数由Joseph Gama开发)
(Function developped by Joseph Gama)
然后只需使用此查询即可获取匹配项
And then simply use this query to get matches
SELECT A.Customer,
b.ID,
b.Customer
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;
创建该函数后,请完成脚本:
Complete Script after you create that function:
IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
DROP TABLE #ExistingCustomers;
CREATE TABLE #ExistingCustomers
(Customer VARCHAR(255),
ID INT
);
INSERT INTO #ExistingCustomers
VALUES
('Ed''s Barbershop',
1002
);
INSERT INTO #ExistingCustomers
VALUES
('GroceryTown',
1003
);
INSERT INTO #ExistingCustomers
VALUES
('Candy Place',
1004
);
INSERT INTO #ExistingCustomers
VALUES
('Handy Man',
1005
);
IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
DROP TABLE #POTENTIALCUSTOMERS;
CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Eds Barbershop');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Grocery Town');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Candy Place');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Handee Man');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Beauty Salon');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('The Apple Farm');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Igloo Ice Cream');
INSERT INTO #POTENTIALCUSTOMERS
VALUES('Ride-a-Long Bikes');
SELECT A.Customer,
b.ID,
b.Customer
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;
在这里您可以在 http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
这篇关于SQL模糊联接-MSSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!