问题描述
SAS 中有 Jaro-Winkler 字符串比较的实现吗?
Is there an implementation of the Jaro-Winkler string comparison in SAS?
看起来 Link King 有 Jaro-Winkler,但我会更喜欢自己调用函数的灵活性.
It looks like Link King has Jaro-Winkler, but I'd prefer the flexibility of calling the function myself.
谢谢!
推荐答案
据我所知,jaro-winkler 距离没有内置函数.@Itzy 已经引用了我所知道的唯一一个.您可以使用 proc fcmp
滚动您自己的函数,但如果您愿意的话.我什至会从下面的代码开始.我只是尝试关注关于它的维基百科文章.无论如何,它肯定不是 Bill Winkler 的 strcmp.c 文件的完美代表,并且可能有很多错误.
There is no built in function for jaro-winkler distance that I am aware of. @Itzy already reference the only ones that I know of. You can roll you own functions with proc fcmp
though if you feel up to it. I'll even give you a head start with the code below. I just tried to follow the wikipedia article on it. It certainly isn't close to being a perfect representation of Bill Winkler's strcmp.c file by any means and likely has lots of bugs.
proc fcmp outlib=work.jaro.chars;
subroutine jaromatch ( string1 $ , string2 $ , matchChars $);
outargs matchChars;
/* Returns number of matched characters between 2 strings excluding blanks*/
/* two chars from string1 and string2 are considered matching
if they are no farther than floor(max(|s1|, |s2|)/2)-1 */
str1_len = length(strip(string1));
str2_len = length(strip(string2));
allowedDist = floor(max(str1_len, str2_len)/2) -1;
matchChars="";
/* walk through string 1 and match characters to string2 */
do i= 1 to str1_len;
x=substr(string1,i,1);
position = findc(string2,x ,max(1,i-allowedDist));
if position > 0 then do;
if position - i <= allowedDist then do;
y=substr(string2,position,1);
/* build list of matched characters */
matchChars=cats(matchChars,y);
end;
end;
end;
matchChars = strip(matchChars);
endsub;
function jarotrans (string1 $ , string2 $ );
ntrans = 0;
ubnd = min(length(strip(string1)), length(strip(string2)));
do i = 1 to ubnd;
if substr(string1,i,1) ne substr(string2,i,1) then do;
ntrans + 1;
end;
end;
return(ntrans/2);
endsub;
function getPrefixlen( string1 $ , string2 $, maxprelen);
/* get the length of the matching characters at the beginning */
n = min(maxprelen, length(string1), length(string2));
do i = 1 to n;
if substr(string1,i,1) ne substr(string2,i,1)
then return(max(1,i-1));
end;
endsub;
function jarodist(string1 $, string2 $);
/* get number of matched characters */
call jaromatch(string1, string2, m1);
m1_len = length(m1);
if m1_len = 0 then return(0);
call jaromatch(string2, string1, m2);
m2_len = length(m2);
if m2_len = 0 then return(0);
/* get number of transposed characters */
ntrans = jarotrans(m1, m2);
put m1_len= m2_len= ntrans= ;
j_dist = (m1_len/length(string1)
+ m2_len/length(string2)
+ (m1_len-ntrans)/m1_len ) / 3;
return(j_dist);
endsub;
function jarowink( string1 $, string2 $, prefixscale);
jarodist=jarodist(string1, string2);
prelen=getPrefixlen(string1, string2, 4);
if prelen = 0 then return(jarodist);
else return(jarodist + prelen * prefixscale * (1-jarodist));
endsub;
run;quit;
/* tell SAS where to find the functions we just wrote */
option cmplib=work.jaro;
/* Now let's try it out! */
data _null_;
string1='DIXON';
string2='DICKSONX';
x=jarodist(string1, string2);
y=jarowink(string1, string2, 0.1);
put x= y=;
run;
这篇关于SAS 中的 Jaro-Winkler 字符串比较函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!