我有一个数据框,其中一列仅包含字符串元组,而索引是每月的一天和一周的一天。

所以:

    all_app_id  day_of_week dayofmonth
0   (g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx...   Fri 29
1   (awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8...   Sat 30
2   (kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz...   Sun 31
3   (ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9...   Mon 1
4   (dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8...   Tues    2
5   (ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu...   Weds    3


以及唯一的app_ids列表:

array(['g8m4lecv', 'uyhsx6lo', 'kz8udlea', ..., 'nrow38d5', 'dekfx963',
       'ybkne8eo'], dtype=object)


实际上,我要计算的是,对于唯一app_id列表中的每个app_id,它们出现的次数均按day_of_weekdayofmonth分组,因此对于每个app_id,我可以计算出它们最常见的day_of_weekdayofmonth,总的来说,我只需对结果数据帧进行操作,就可以计算出最常见的日期和月份。

示例输出(此后应该很容易)是:

星期几

           Mon   Tues   Weds  Thurs  Fri
g8m4lecv   34      53     84      97     20
uyhsx6lo   3423    5443   235     33     54
kz8udlea   121     1212   39      93     756


..

每月的一天

           1       2      3       4      5 ...
g8m4lecv   3       5     8        7     20
uyhsx6lo   12      12    23       233   54
kz8udlea   31      212   39       93     2


...

我一直在尝试从类似的答案中使用groupbyfor app_id in unique_app_idsstr.contains(app_id).sum(),但是我似乎无法理解如何构造它以及如何组织循环。我也在研究list(zip())合并元组,但是再次不确定如何在没有分组依据的情况下按星期合并。

最佳答案

df = pd.DataFrame(
    {'all_app_id': ['g8m4lecv, uyhsx6lo, kz8udlea, uyhsx6lo, uyhsx',
                    'awkcmpns, yg02r5dz, yg02r5dz, yg02r5dz, t6rf8',
                    'kz8udlea, 6f91a1e2, mv4uo5xy, mv4uo5xy, lfqlz',
                    'ze4rr0vi, t6rf8pr, ze4rr0vi, 3c83655d34c365c9',
                    'dflewd0u, twidi4l3, twidi4l3, ecur4180, e0ff8',
                    'ohafu6x5, kz8udlea, ohafu6x5, jfigr5ko, ohafu'],
     'day_of_week': ['Fri', 'Sat', 'Sun', 'Mon', 'Tues', 'Weds'],
     'dayofmonth': [29, 30, 31, 1, 2, 3]})

>>> (pd.melt(
         pd.concat([df.all_app_id.str.split(',', expand=True),
                    df[['day_of_week', 'dayofmonth']]],
                   axis=1),
         id_vars=['day_of_week', 'dayofmonth'],
         value_name='app_id')
     .drop('variable', axis=1)
     .pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
     .fillna(0)
     )
                  dayofmonth
day_of_week              Fri Mon Sat Sun Tues Weds
app_id
 3c83655d34c365c9          0   1   0   0    0    0
 6f91a1e2                  0   0   0   1    0    0
 e0ff8                     0   0   0   0    1    0
 ecur4180                  0   0   0   0    1    0
 jfigr5ko                  0   0   0   0    0    1
 kz8udlea                  1   0   0   0    0    1
 lfqlz                     0   0   0   1    0    0
 mv4uo5xy                  0   0   0   2    0    0
 ohafu                     0   0   0   0    0    1
 ohafu6x5                  0   0   0   0    0    1
 t6rf8                     0   0   1   0    0    0
 t6rf8pr                   0   1   0   0    0    0
 twidi4l3                  0   0   0   0    2    0
 uyhsx                     1   0   0   0    0    0
 uyhsx6lo                  2   0   0   0    0    0
 yg02r5dz                  0   0   3   0    0    0
 ze4rr0vi                  0   1   0   0    0    0
awkcmpns                   0   0   1   0    0    0
dflewd0u                   0   0   0   0    1    0
g8m4lecv                   1   0   0   0    0    0
kz8udlea                   0   0   0   1    0    0
ohafu6x5                   0   0   0   0    0    1
ze4rr0vi                   0   1   0   0    0    0


首先,您需要将all_app_id列拆分为单独的列:

>>> df.all_app_id.str.split(',', expand=True)
          0          1          2                  3       4
0  g8m4lecv   uyhsx6lo   kz8udlea           uyhsx6lo   uyhsx
1  awkcmpns   yg02r5dz   yg02r5dz           yg02r5dz   t6rf8
2  kz8udlea   6f91a1e2   mv4uo5xy           mv4uo5xy   lfqlz
3  ze4rr0vi    t6rf8pr   ze4rr0vi   3c83655d34c365c9    None
4  dflewd0u   twidi4l3   twidi4l3           ecur4180   e0ff8
5  ohafu6x5   kz8udlea   ohafu6x5           jfigr5ko   ohafu


然后,您重新粘贴dayofmonth和day_of_week列:

df2 = pd.concat([df.all_app_id.str.split(',', expand=True),
                 df[['day_of_week', 'dayofmonth']]], axis=1)
>>> df2
          0          1          2                  3       4 day_of_week  dayofmonth
0  g8m4lecv   uyhsx6lo   kz8udlea           uyhsx6lo   uyhsx         Fri          29
1  awkcmpns   yg02r5dz   yg02r5dz           yg02r5dz   t6rf8         Sat          30
2  kz8udlea   6f91a1e2   mv4uo5xy           mv4uo5xy   lfqlz         Sun          31
3  ze4rr0vi    t6rf8pr   ze4rr0vi   3c83655d34c365c9    None         Mon           1
4  dflewd0u   twidi4l3   twidi4l3           ecur4180   e0ff8        Tues           2
5  ohafu6x5   kz8udlea   ohafu6x5           jfigr5ko   ohafu        Weds           3


然后,您将结果融化:

>>> df3 = pd.melt(df2, id_vars=['day_of_week', 'dayofmonth'], value_name='app_id')
>>> df3
   day_of_week  dayofmonth variable             app_id
0          Fri          29        0           g8m4lecv
1          Sat          30        0           awkcmpns
2          Sun          31        0           kz8udlea
...
28        Tues           2        4              e0ff8
29        Weds           3        4              ohafu


删除variable列:

df3 = df3.drop('variable', axis=1)


然后透视结果:

df4 = df3.pivot_table(index='app_id', columns='day_of_week', aggfunc='count')
>>> df4
                  dayofmonth
day_of_week              Fri Mon Sat Sun Tues Weds
app_id
 3c83655d34c365c9        NaN   1 NaN NaN  NaN  NaN
 6f91a1e2                NaN NaN NaN   1  NaN  NaN
 e0ff8                   NaN NaN NaN NaN    1  NaN
 ecur4180                NaN NaN NaN NaN    1  NaN
...


然后,用零填充NaN值。

df4 = df4.fillna(0)


此时,您还可以从顶部栏中删除dayofmonth

df4.columns = df4.columns.droplevel(0)


瞧!

>>> df4.head()
day_of_week        Fri  Mon  Sat  Sun  Tues  Weds
app_id
 3c83655d34c365c9    0    1    0    0     0     0
 6f91a1e2            0    0    0    1     0     0
 e0ff8               0    0    0    0     1     0
 ecur4180            0    0    0    0     1     0
 jfigr5ko            0    0    0    0     0     1

07-25 20:54