问题描述
我有两个Excel工作表.一个包含摘要,另一个包含具有潜在过滤词的类别.如果第二个数据帧中有任何元素匹配,我需要为第一个数据帧分配类别.
I have two excel sheets. One contains summaries and the other contains categories with potential filter words. I need to assign categories to the first dataframe if any element matches in the second dataframe.
我试图扩展第二个数据框中的列表,并通过将术语与第一个数据框中的任何单词匹配来进行映射.
I have attempted to expand the list in the second dataframe and map by matching the terms to any words in the first dataframe.
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
Bucket Summary
0 basket This is a basket of red apples. They are sour.
1 bushel We found a bushel of fruit. They are red and s...
2 peck There is a peck of pears that taste sweet. The...
3 box We have a box of plums. They are sour and have...
print(df2)
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
此脚本行将表中的类别"列转换为列表,供以后使用.
category_list = df2['Category'].values
category_list = list(set(category_list))
尝试匹配文本.
for item in category_list:
item = df2.loc[df2['Category'] == item]
filter_list = item['Filters'].values
filter_list = list(set(filter_list))
df1 = df1 [df1 ['Summary'].isin(filter_list)]
我希望第一个数据框具有以逗号分隔的类别.
I want the first dataframe to have categories assigned to it separated by a comma.
结果:
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit We have a box of plums. They are sour and have...
我希望这很清楚.我已经撞了一个星期了.
I hope this is clear. I have been banging my head against it for a week now.
提前谢谢
推荐答案
使用 pandas.Series.str.contains 来检查带有循环的过滤器:
Use pandas.Series.str.contains to check Filters with a loop:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)
0 True
1 False
2 True
3 True
dtype: bool
0 True
1 True
2 True
3 False
dtype: bool
然后将 np.where 与 Series.str.cat 获取数据框输出:
df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
这篇关于如果类别包含其他数据框中的任何元素,如何在数据框中分配类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!