我有两个数据框,每个都有一个名为Song的列。但是有时歌曲的拼写有所不同。如何使用difflib(或类似的东西)在另一个数据框的新列中获取一个数据框的Song拼写?

例如:

Dataframe1

Song           Artist

like a virgi   madonna


Dataframe2

Song          Rank

like a virgin  2


Result

Song            Artist    SongAlt

like a virgin   Madonna   like a virgi

最佳答案

步骤1:合并所有可以合并的内容

In [67]: df1
Out[67]:
           Song    Artist
0        mysong  myartist
1  like a virgi   madonna

In [68]: df2
Out[68]:
            Song  Rank
0         mysong     1
1  like a virgin     2

In [69]: merged = pd.merge(df1, df2, on='Song')

In [70]: merged
Out[70]:
     Song    Artist  Rank
0  mysong  myartist     1


步骤2:找出剩余的内容

In [71]: unmerged = df2[~df2.isin(merged)].dropna()

In [72]: unmerged
Out[72]:
            Song  Rank
1  like a virgin   2.0


步骤3:使用difflib的get_close_matches获得最接近的匹配项

In [73]: songs = list(df1['Song'].unique())

In [74]: def closest(a):
    ...:     try:
    ...:         return difflib.get_close_matches(a, songs)[0]
    ...:     except IndexError:
    ...:         return "Not Found"

In [75]: unmerged['closest_song'] = unmerged.apply(lambda row: closest(row['Song']), axis=1)

In [76]: unmerged
Out[76]:
            Song  Rank  closest_song
1  like a virgin   2.0  like a virgi


步骤4:如果需要,可获取相似百分比

In [77]: def similar(a, b):
    ...:     return difflib.SequenceMatcher(None, a, b).ratio()

In [78]: unmerged['Similarity'] = unmerged.apply(lambda row: similar(row['closest_song'], row['Song']), axis=1)

In [79]: unmerged
Out[79]:
            Song  Rank  closest_song  Similarity
1  like a virgin   2.0  like a virgi        0.96


步骤5:使用最接近的值合并

In [80]: unmerged.rename(columns={'Song': 'Old_Song', 'closest_song': 'Song'}, inplace=True)

In [81]: new = unmerged.merge(df1, on='Song')[['Song', 'Artist', 'Rank']]
Out[81]:
           Song   Artist  Rank
0  like a virgi  madonna   2.0

In [82]: pd.concat([merged, new])
Out[82]:
           Song    Artist  Rank
0        mysong  myartist   1.0
0  like a virgi   madonna   2.0

08-19 21:14