python - 在Pandas系列(Python)中查找关键字的子集

我正在使用看起来像这样的Series：

l0 = ['smartphone', 'battery', 'case', 'grey', '10071852']
l1 = ['phone', 'new', 'charging', 'case', 'white']
l2 = ['tablet', 'phone', 'pin', 'adapter', 'ex766']
l3 = ['phone', 'silicon', 'case', 'brown']

mySeries = pd.Series([l0,l1,l2,l3])

print(mySeries)

0    [smartphone, battery, case, grey, 10071852]
1            [phone, new, charging, case, white]
2           [tablet, phone, pin, adapter, ex766]
3                  [phone, silicon, case, brown]

我正在尝试搜索该系列的每一行（每个列表）中可能包含的关键字和关键字组。更具体地说，假设我要查找“系列”中的一行是否包含以下关键字：

simple_keywords = {'case', 'adapter'}

而且还要查找该系列是否包含以下关键字对：

double_keywords = {'battery case', 'charging case'}

寻找simple_keywords似乎很容易。但是，我也要查找这些对，并确保如果有一个类似“电池盒”的对，我希望它返回一对关键字，而不仅仅是“ case”。

另外，我有一个如下所示的数据框：

d = {'Date': ['03/08/2014', '04/08/2014', '05/08/2014', '06/08/2014'], 'Product': ['none', 'none','none','none'],'Frequency': [5, 10, 1, 2]}
myDF = pd.DataFrame(data=d)

print(myDF)

         Date  Frequency Product
0  03/08/2014          5    none
1  04/08/2014         10    none
2  05/08/2014          1    none
3  06/08/2014          2    none

我的最终目标是在此数据框中（在“产品”列中）写入我在系列中标识的相应关键字（或一对关键字）。系列的每一行都与数据框中的同一行相对应，这意味着顺序非常重要。我想看看2014年8月3日的产品“电池盒”的频率为5。

我试图通过分离成对的关键字来提出一些解决方案，但是它似乎非常慢并且效率不高，因为我正在使用的系列中有超过350'000行（将其放置一整夜，还没完成）：

first_keywords = {'case', 'adapter'}
second_keywords = {'battery', 'charging'}

mySeries_range = len(mySeries)

for i in range(mySeries_range):
        for x, y in [(x, y) for x in first_keywords for y in second_keywords]:
            if x in mySeries[i] and y in mySeries[i]:
                myDF.Product[i] = y + ' ' + x
            elif x in mySeries[i] and y not in mySeries[i]:
                myDF.Product[i] = x

我希望获得的最终结果是：

         Date  Frequency        Product
0  03/08/2014          5   battery case
1  04/08/2014         10  charging case
2  05/08/2014          1        adapter
3  06/08/2014          2           case

如果有人可以帮助我，那将是很棒的。抱歉，如果我的代码不太漂亮，请尝试...变得更好！

最佳答案

您可以通过以下方式从mySeries列表中的担忧中生成任意数量的组合：

import itertools
df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
                     for l in list(itertools.combinations(x,max_len))
                     ]).rename(max_len)
                     for max_len in [1,2]],axis=1).astype(str)

结果如下：

>>> df_comb                                             1  \
0  [smartphone, battery, case, grey, 10071852]
1          [phone, new, charging, case, white]
2         [tablet, phone, pin, adapter, ex766]
3                [phone, silicon, case, brown]

                                                   2
0  [smartphone battery, smartphone case, smartpho...
1  [phone new, phone charging, phone case, phone ...
2  [tablet phone, tablet pin, tablet adapter, tab...
3  [phone silicon, phone case, phone brown, silic...

现在，让单词的字典成为列表，以便更容易地进行迭代：

simple_keywords = ['case', 'adapter']
double_keywords = ['battery case', 'charging case']

然后，您可以通过以下方式计算元素：

>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(' '.split(w))].rename(w)
for w in simple_keywords],axis=1)
   case  adapter
0     1        0
1     1        0
2     0        1
3     1        0

>>> pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),axis=0)[len(w.split(' '))].rename(w) for w in double_keywords],axis=1)

  battery case  charging case
0             1              0
1             0              1
2             0              0
3             0              0

或者我们可以这样迭代：

df_count = pd.DataFrame()
for list_of_keywords in [simple_keywords, double_keywords]:
    df_count_temp = pd.concat([df_comb.apply(lambda x: pd.Series(x).str.count(w),
                               axis=0)[len(w.split(' '))].rename(w)
                               for w in list_of_keywords],axis=1)
    df_count = pd.concat([df_count, df_count_temp],axis=1)

计数将是：

>>> df_count

   case  adapter  battery case  charging case
0     1        0             1              0
1     1        0             0              1
2     0        1             0              0
3     1        0             0              0

您可以通过以下方式获得最终计数：

>>> df_count.sum(axis=0).to_frame()

               0
case           3
adapter        1
battery case   1
charging case  1

您可以创建一个函数将此功能应用于每天的条目。

def my_func(mySeries, keywords =  [['case', 'adapter'] ,['battery case', 'charging case']]):
    import itertools
    keyword_lengths = [len(k[0].split(' ')) for k in keywords]
    df_comb = pd.concat([mySeries.apply(lambda x: [" ".join(l)
                         for l in list(itertools.combinations(x,max_len))
                         ]).rename(max_len)
                         for max_len in keyword_lengths],axis=1).astype(str)

    df_count = pd.DataFrame()
    for list_of_keywords in keywords:
        df_count_temp = pd.concat([df_comb.apply(lambda x:pd.Series(x).str.count(w),
                                   axis=0)[len(w.split(' '))].rename(w)
                                   for w in list_of_keywords],axis=1)
        df_count = pd.concat([df_count, df_count_temp],axis=1)

    return df_count

想象这是您的pd.Series：

>>> newSeries
2014-03-08    [smartphone, battery, case, grey, 10071852]
2014-03-08            [phone, new, charging, case, white]
2014-03-08           [tablet, phone, pin, adapter, ex766]
2014-03-08                  [phone, silicon, case, brown]
2014-04-08            [phone, new, charging, case, white]
2014-04-08                           [tablet, phone, pin]
2014-04-08                               [phone, adapter]
dtype: object



>>> my_func(newSeries)

            case  adapter  battery case  charging case
2014-03-08     1        0             1              0
2014-03-08     1        0             0              1
2014-03-08     0        1             0              0
2014-03-08     1        0             0              0
2014-04-08     1        0             0              1
2014-04-08     0        0             0              0
2014-04-08     0        1             0              0

然后，您可以播放按日期分组并计数元素的返回数据框。这样，您将按日期获取外观：

>>> df_appearances= my_func(newSeries).reset_index().groupby('index'
                     ).sum().T.unstack().reset_index()

>>> df_appearances.columns = ['Date', 'Product', 'Frequency']

>>> df_appearances

        Date        Product  Frequency
0 2014-03-08           case          3
1 2014-03-08        adapter          1
2 2014-03-08   battery case          1
3 2014-03-08  charging case          1
4 2014-04-08           case          1
5 2014-04-08        adapter          1
6 2014-04-08   battery case          0
7 2014-04-08  charging case          1