我有一个看起来像这样的df:

Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity


我要为每个组执行的操作是根据属性的交集找到最喜欢的组。我想要的最终形式是:

Group   TotalCount   LikeGroup   CommonWords  PCT

Cheese  3            Yogurt      3            100.0
Cow     5            Cheese      2            40.0
Yogurt  4            Cheese      4            75.0


我意识到这可能会在一个问题中提出很多问题。我可以做很多事情,但是我对获得属性的交集感到迷茫,即使只是在一个组和另一个组之间也是如此。如果我能找到Cheese和Yogurt之间的交点数量,就会使我朝正确的方向前进。

是否有可能在数据框中执行此操作?我可以看到制作几个列表并在所有列表对之间进行交集,然后使用新的列表长度来获取百分比。

例如,对于酸奶:

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75

最佳答案

我创建了自己的示例数组的较小版本。

import pandas as pd
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()


对于我的样本数据,我得到了输出

    Group LikeGroup  CommonWords  TotalCount  PCT
0  cheese       cow            2           3   67
1  cheese    yogurt            3           3  100
2     cow    cheese            2           2  100
3     cow    yogurt            2           2  100
4  yogurt    cheese            3           4   75
5  yogurt       cow            2           4   50


这似乎是正确的。

关于python - 根据不同列中值的交集查找相似的组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42425273/

10-15 23:30