问题描述
******使用下面的解决方案进行编辑*******
******Edited with Solution Below*******
我已经仔细阅读了指南,希望这个问题可以接受.
I have carefully read the guidelines, hope the question is acceptable.
我有两个熊猫数据框,我需要在目标列和参考列上应用模糊匹配函数,并根据相似度评分合并数据,以保留原始数据.
I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.
我已经检查了类似的问题,例如参见:
是否可以使用python进行模糊匹配合并熊猫?但我无法使用此解决方案.到目前为止,我有:
i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas?but I am not able to use this solution.So far I have:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
-
如何在最终输出中包括"NameId"和"Type"(以及其他最终列),例如:
How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn
1 robert johnes Person … 0.0000 1 roberto johnes Person …
是否有一种方法可以对此进行编码,从而使其易于扩展,并且可以在具有数十万行的数据集上执行?
Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?
我已经通过在循环中解包数据帧解决了原始问题:
I have solved the original problem by unpacking the dataframes in the loop:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)
推荐答案
我相信您需要的是 Cartesian TarName
和RefName
的乘积.将距离功能应用到产品是您所需的结果.
I believe what you need is Cartesian Product of TarName
and RefName
. Applying distance function to the product is the result you required.
df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
结果:
NameId_x RefName Type_x mergekey NameId_y TarName Type_y Distance
0 1 robert johnes Person 0 1 roberto johnes Person 0.000000
1 1 robert johnes Person 0 2 lew malinosky Person 0.705882
2 1 robert johnes Person 0 3 andreatta della blatta Person 0.538462
3 2 lew malinsky Person 0 1 roberto johnes Person 0.764706
4 2 lew malinsky Person 0 2 lew malinosky Person 0.083333
5 2 lew malinsky Person 0 3 andreatta della blatta Person 0.666667
6 3 gioberto delle lanterne Person 0 1 roberto johnes Person 0.533333
7 3 gioberto delle lanterne Person 0 2 lew malinosky Person 0.588235
8 3 gioberto delle lanterne Person 0 3 andreatta della blatta Person 0.250000
这篇关于如何在 pandas 数据帧的目标列和参考列上应用模糊匹配函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!