问题描述
我有 2 个数据集,每个数据集超过 10 万行.我想基于匹配一列('电影标题')的模糊字符串以及使用发布日期来合并它们.我提供了来自以下两个数据集的样本.
I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
数据集-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
数据集 - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
我看过agrep",但它一次只匹配一个字符串.'stringdist' 函数很好,但您需要在循环中运行它,找到最小距离,然后进行进一步的进动,考虑到数据集的大小,这非常耗时.由于需要模糊匹配,字符串可能有拼写错误和特殊字符.我环顾四周,发现了Lenenshtein"和Jaro-Winkler"方法.当你在字符串中有拼写错误时,我读的越晚越好.
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
在这种情况下,只有模糊匹配可能不会提供好的结果,例如,一个数据集中的电影标题玩具总动员"可以与另一个数据集中的玩具总动员 2"匹配,但这是不正确的.所以我需要考虑上映日期,以确保匹配的电影是独一无二的.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
我想知道是否有一种方法可以在不使用循环的情况下完成此任务?更糟糕的情况是,如果我必须使用循环,我怎样才能让它尽可能高效地工作.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
我已经尝试了以下代码,但它花费了大量的时间来处理.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - 包含 1682 个转换为小写的唯一电影名称test1 - 包含 11451 个转换为小写的唯一电影名称
test - contains 1682 unique movie names converted to lower casetest1 - contains 11451 unique movie names converted to lower case
有没有办法避免 for 循环并使其工作得更快?
Is there a way to avoid the for loops and make it work faster?
推荐答案
这种方法如何推动您前进?看到结果后,您可以从 0.85 调整匹配度.然后,您可以使用 dplyr 按匹配的标题分组并通过减去发布日期进行总结.任何零都表示相同的发布日期.
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)
这篇关于r 中的模糊字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!