本文介绍了SQL模糊联接-MSSQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两组数据.现有客户和潜在客户.

I have two sets of data. Existing customers and potential customers.

我的主要目标是弄清楚是否有任何潜在客户已经是现有客户.但是,跨数据集的客户命名约定不一致.

My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent.

现有客户

Customer /  ID
Ed's Barbershop /   1002
GroceryTown /   1003
Candy Place /   1004
Handy Man / 1005

潜在客户

Customer
Eds Barbershop
Grocery Town
Candy Place
Handee Man
Beauty Salon
The Apple Farm
Igloo Ice Cream
Ride-a-Long Bikes

我想编写如下类型的select语句来达到我的目标:

I would like to write some type of select statement like below to reach my objective:

SELECT a.Customer, b.ID
FROM PotentialCustomers a LEFT JOIN
     ExistingCustomers B
     ON a.Customer = b.Customer

结果类似于:

Customer /  ID
Eds Barbershop  / 1002
Grocery Town    / 1003
Candy Place / 1004
Handee Man  / 1005
Beauty Salon /  NULL
The Apple Farm /    NULL
Igloo Ice Cream / NULL
Ride-a-Long Bikes / NULL

我对Levenshtein距离和Double Metaphone的概念含糊其词,但是我不确定如何在此处应用它.

I am vaguely familiar with the concepts of Levenshtein Distance and Double Metaphone but I am not sure how to apply it here.

理想情况下,我希望SELECT语句的JOIN部分读取类似以下内容的内容:LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer,但我知道语法不正确.

Ideally I would want the JOIN portion of the SELECT statement to read something like: LEFT JOIN ExistingCustomers as B WHERE a.Customer LIKE b.Customer but I know that syntax is incorrect.

欢迎任何建议.谢谢!

推荐答案

以下是使用Levenshtein Distance的方法:

Here is how this could be done using Levenshtein Distance:

创建此函数:(先执行此操作)

Create this function:(Execute this first)

CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

(函数由Joseph Gama开发)

(Function developped by Joseph Gama)

然后只需使用此查询即可获取匹配项

And then simply use this query to get matches

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

创建该函数后,请完成脚本:

Complete Script after you create that function:

IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
    DROP TABLE #ExistingCustomers;

CREATE TABLE #ExistingCustomers
(Customer VARCHAR(255),
 ID       INT
);

INSERT INTO #ExistingCustomers
VALUES
('Ed''s Barbershop',
 1002
);

INSERT INTO #ExistingCustomers
VALUES
('GroceryTown',
 1003
);

INSERT INTO #ExistingCustomers
VALUES
('Candy Place',
 1004
);

INSERT INTO #ExistingCustomers
VALUES
('Handy Man',
 1005
);

IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
    DROP TABLE #POTENTIALCUSTOMERS;

CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Eds Barbershop');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Grocery Town');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Candy Place');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Handee Man');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Beauty Salon');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('The Apple Farm');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Igloo Ice Cream');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Ride-a-Long Bikes');

SELECT A.Customer,
       b.ID,
       b.Customer
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

在这里您可以在 http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

这篇关于SQL模糊联接-MSSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-12 21:48