所以,这是我的数据框。

session_id  question_difficulty     attempt_updated_at
5c822af21c1fba22            2   1557470128000
5c822af21c1fba22            3   1557469685000
5c822af21c1fba22            4   1557470079000
5c822af21c1fba22            5   1557472999000
5c822af21c1fba22            3   1557474145000
5c822af21c1fba22            3   1557474441000
5c822af21c1fba22            4   1557474299000
5c822af21c1fba22            4   1557474738000
5c822af21c1fba22            3   1557475430000
5c822af21c1fba22            4   1557476960000
5c822af21c1fba22            5   1557477458000
5c822af21c1fba22            2   1557478118000
5c822af21c1fba22            5   1557482556000
5c822af21c1fba22            4   1557482809000
5c822af21c1fba22            5   1557482886000
5c822af21c1fba22            5   1557484232000


我想将字段“ attempt_updated_at”(是新纪元时间)切成两个相等的bin,并在每个会话的那个bin中找到“ question_difficulty”的平均值。

我想分别存储第一仓和第二仓的均值。

我试图通过pd.cut,但我不知道如何使用它。

我希望我的输出像

例如,

session_id         mean1_difficulty       mean2_difficulty
5c822af21c1fba22            5.0                3.0


任何想法表示赞赏,
谢谢。

最佳答案

我相信您需要qcut和总计mean

df1 = (df.groupby(['session_id', pd.qcut(df['attempt_updated_at'], 2, labels=False)])
         ['question_difficulty'].mean()
                                .unstack()
                                .rename(columns=lambda x: f'mean{x+1}_difficulty'))
print (df1)
attempt_updated_at  mean1_difficulty  mean2_difficulty
session_id
5c822af21c1fba22                 3.5             4.125


cut

df1 = (df.groupby(['session_id', pd.cut(df['attempt_updated_at'], 2, labels=False)])
         ['question_difficulty'].mean()
                                .unstack()
                                .rename(columns=lambda x: f'mean{x+1}_difficulty'))
print (df1)
attempt_updated_at  mean1_difficulty  mean2_difficulty
session_id
5c822af21c1fba22            3.444444          4.285714


函数之间的差异可以更好地解释here

08-20 02:38