我正在使用看起来像这样的Series:
l0 = ['smartphone', 'battery', 'case', 'grey', '10071852']
l1 = ['phone', 'new', 'charging', 'case', 'white']
l2 = ['tablet', 'phone', 'pin', 'adapter', 'ex766']
l3 = ['phone', 'silicon', 'case', 'brown']
mySeries = pd.Series([l0,l1,l2,l3])
print(mySeries)
0 [smartphone, battery, case, grey, 10071852]
1 [phone, new, charging, case, white]
2 [tablet, phone, pin, adapter, ex766]
3 [phone, silicon, case, brown]
我正在尝试搜索该系列的每一行(每个列表)中可能包含的关键字和关键字组。更具体地说,假设我要查找“系列”中的一行是否包含以下关键字:
simple_keywords = {'case', 'adapter'}
而且还要查找该系列是否包含以下关键字对:
double_keywords = {'battery case', 'charging case'}
寻找simple_keywords似乎很容易。但是,我也要查找这些对,并确保如果有一个类似“电池盒”的对,我希望它返回一对关键字,而不仅仅是“ case”。
另外,我有一个如下所示的数据框:
d = {'Date': ['03/08/2014', '04/08/2014', '05/08/2014', '06/08/2014'], 'Product': ['none', 'none','none','none'],'Frequency': [5, 10, 1, 2]}
myDF = pd.DataFrame(data=d)
print(myDF)
Date Frequency Product
0 03/08/2014 5 none
1 04/08/2014 10 none
2 05/08/2014 1 none
3 06/08/2014 2 none
我的最终目标是在此数据框中(在“产品”列中)写入我在系列中标识的相应关键字(或一对关键字)。系列的每一行都与数据框中的同一行相对应,这意味着顺序非常重要。我想看看2014年8月3日的产品“电池盒”的频率为5。
我试图通过分离成对的关键字来提出一些解决方案,但是它似乎非常慢并且效率不高,因为我正在使用的系列中有超过350'000行(将其放置一整夜,还没完成):
first_keywords = {'case', 'adapter'}
second_keywords = {'battery', 'charging'}
mySeries_range = len(mySeries)
for i in range(mySeries_range):
for x, y in [(x, y) for x in first_keywords for y in second_keywords]:
if x in mySeries[i] and y in mySeries[i]:
myDF.Product[i] = y + ' ' + x
elif x in mySeries[i] and y not in mySeries[i]:
myDF.Product[i] = x
我希望获得的最终结果是:
Date Frequency Product
0 03/08/2014 5 battery case
1 04/08/2014 10 charging case
2 05/08/2014 1 adapter
3 06/08/2014 2 case
如果有人可以帮助我,那将是很棒的。抱歉,如果我的代码不太漂亮,请尝试...变得更好!
最佳答案
您可以通过以下方式从mySeries
列表中的担忧中生成任意数量的组合:
import itertools
df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
for l in list(itertools.combinations(x,max_len))
]).rename(max_len)
for max_len in [1,2]],axis=1).astype(str)
结果如下:
>>> df_comb 1 \
0 [smartphone, battery, case, grey, 10071852]
1 [phone, new, charging, case, white]
2 [tablet, phone, pin, adapter, ex766]
3 [phone, silicon, case, brown]
2
0 [smartphone battery, smartphone case, smartpho...
1 [phone new, phone charging, phone case, phone ...
2 [tablet phone, tablet pin, tablet adapter, tab...
3 [phone silicon, phone case, phone brown, silic...
现在,让单词的字典成为列表,以便更容易地进行迭代:
simple_keywords = ['case', 'adapter']
double_keywords = ['battery case', 'charging case']
然后,您可以通过以下方式计算元素:
>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(' '.split(w))].rename(w)
for w in simple_keywords],axis=1)
case adapter
0 1 0
1 1 0
2 0 1
3 1 0
>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(w.split(' '))].rename(w) for w in double_keywords],axis=1)
battery case charging case
0 1 0
1 0 1
2 0 0
3 0 0
或者我们可以这样迭代:
df_count = pd.DataFrame()
for list_of_keywords in [simple_keywords, double_keywords]:
df_count_temp = pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),
axis=0)[len(w.split(' '))].rename(w)
for w in list_of_keywords],axis=1)
df_count = pd.concat([df_count, df_count_temp],axis=1)
计数将是:
>>> df_count
case adapter battery case charging case
0 1 0 1 0
1 1 0 0 1
2 0 1 0 0
3 1 0 0 0
您可以通过以下方式获得最终计数:
>>> df_count.sum(axis=0).to_frame()
0
case 3
adapter 1
battery case 1
charging case 1
您可以创建一个函数将此功能应用于每天的条目。
def my_func(mySeries, keywords = [['case', 'adapter'] ,['battery case', 'charging case']]):
import itertools
keyword_lengths = [len(k[0].split(' ')) for k in keywords]
df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
for l in list(itertools.combinations(x,max_len))
]).rename(max_len)
for max_len in keyword_lengths],axis=1).astype(str)
df_count = pd.DataFrame()
for list_of_keywords in keywords:
df_count_temp = pd.concat([df_comb.apply(lambda x:pd.Series(x).str.count(w),
axis=0)[len(w.split(' '))].rename(w)
for w in list_of_keywords],axis=1)
df_count = pd.concat([df_count, df_count_temp],axis=1)
return df_count
想象这是您的pd.Series:
>>> newSeries
2014-03-08 [smartphone, battery, case, grey, 10071852]
2014-03-08 [phone, new, charging, case, white]
2014-03-08 [tablet, phone, pin, adapter, ex766]
2014-03-08 [phone, silicon, case, brown]
2014-04-08 [phone, new, charging, case, white]
2014-04-08 [tablet, phone, pin]
2014-04-08 [phone, adapter]
dtype: object
>>> my_func(newSeries)
case adapter battery case charging case
2014-03-08 1 0 1 0
2014-03-08 1 0 0 1
2014-03-08 0 1 0 0
2014-03-08 1 0 0 0
2014-04-08 1 0 0 1
2014-04-08 0 0 0 0
2014-04-08 0 1 0 0
然后,您可以播放按日期分组并计数元素的返回数据框。这样,您将按日期获取外观:
>>> df_appearances= my_func(newSeries).reset_index().groupby('index'
).sum().T.unstack().reset_index()
>>> df_appearances.columns = ['Date', 'Product', 'Frequency']
>>> df_appearances
Date Product Frequency
0 2014-03-08 case 3
1 2014-03-08 adapter 1
2 2014-03-08 battery case 1
3 2014-03-08 charging case 1
4 2014-04-08 case 1
5 2014-04-08 adapter 1
6 2014-04-08 battery case 0
7 2014-04-08 charging case 1