我有一个包含user和session列的dataframe,我想随机抽样会话的数量,以便dataframe包含每个用户N个唯一的会话。会话的顺序很重要,即必须保留每个会话的“in”列。
例如,如果N=2和我有:

        x      in            session_id    user_id
0     0.0     1.0     trn-04a23351-283d       paul
1    -1.0     2.0     trn-04a23351-283d       paul
2    -1.0     3.0     trn-04a23351-283d       paul
3    -1.0     4.0     trn-04a23351-283d       paul
4    -1.0     1.0      blz-412313we-333       paul
5    -1.0     2.0      blz-412313we-333       paul
6     0.0     3.0      blz-412313we-333       paul
7    -1.0     1.0        wha-111111-fff       paul
8     0.0     2.0        wha-111111-fff       paul
9     1.0     1.0         bz-0000-01101      chris
10    0.0     2.0         bz-0000-01101      chris
11   -1.0     1.0       1111-sawas-1221      chris
12   -1.0     2.0       1111-sawas-1221      chris
13    1.0     1.0      pppppppppppppppp      chris
14    1.0     2.0      pppppppppppppppp      chris
15    1.0     3.0      pppppppppppppppp      chris
16   -1.0     1.0     55555555555555555     philip
17   -1.0     2.0     55555555555555555     philip
18   -1.0     3.0     55555555555555555     philip
19   -1.0     1.0       333333333333333     philip
20   -1.0     2.0       333333333333333     philip
21   -1.0     3.0       333333333333333     philip
22    0.0     1.0          zz-222222222     philip
23   -1.0     2.0          zz-222222222     philip
24    0.0     1.0       f-32355261-ss3d      sarah
25   -1.0     2.0       f-32355261-ss3d      sarah
26    0.0     3.0       f-32355261-ss3d      sarah
27    0.0     1.0               adasdfs      sarah
28   -1.0     2.0               adasdfs      sarah
29    0.0     3.0               adasdfs      sarah

我想要:
        x      in            session_id    user_id
0     0.0     1.0     trn-04a23351-283d       paul
1    -1.0     2.0     trn-04a23351-283d       paul
2    -1.0     3.0     trn-04a23351-283d       paul
3    -1.0     4.0     trn-04a23351-283d       paul
4    -1.0     1.0      blz-412313we-333       paul
5    -1.0     2.0      blz-412313we-333       paul
6     0.0     3.0      blz-412313we-333       paul
7     1.0     1.0         bz-0000-01101      chris
8     0.0     2.0         bz-0000-01101      chris
9     1.0     1.0      pppppppppppppppp      chris
10    1.0     2.0      pppppppppppppppp      chris
11    1.0     3.0      pppppppppppppppp      chris
12   -1.0     1.0       333333333333333     philip
13   -1.0     2.0       333333333333333     philip
14   -1.0     3.0       333333333333333     philip
15    0.0     1.0          zz-222222222     philip
16   -1.0     2.0          zz-222222222     philip
17    0.0     1.0       f-32355261-ss3d      sarah
18   -1.0     2.0       f-32355261-ss3d      sarah
19    0.0     3.0       f-32355261-ss3d      sarah
20    0.0     1.0               adasdfs      sarah
21   -1.0     2.0               adasdfs      sarah
22    0.0     3.0               adasdfs      sarah

最佳答案

创建要与之合并的引用数据框

d = df[['session_id', 'user_id']].drop_duplicates()
d = d.groupby('user_id', as_index=False).apply(pd.DataFrame.sample, n=2)

df.merge(d)

      x   in        session_id user_id
0  -1.0  1.0  blz-412313we-333    paul
1  -1.0  2.0  blz-412313we-333    paul
2   0.0  3.0  blz-412313we-333    paul
3  -1.0  1.0    wha-111111-fff    paul
4   0.0  2.0    wha-111111-fff    paul
5   1.0  1.0     bz-0000-01101   chris
6   0.0  2.0     bz-0000-01101   chris
7  -1.0  1.0   1111-sawas-1221   chris
8  -1.0  2.0   1111-sawas-1221   chris
9  -1.0  1.0   333333333333333  philip
10 -1.0  2.0   333333333333333  philip
11 -1.0  3.0   333333333333333  philip
12  0.0  1.0      zz-222222222  philip
13 -1.0  2.0      zz-222222222  philip
14  0.0  1.0   f-32355261-ss3d   sarah
15 -1.0  2.0   f-32355261-ss3d   sarah
16  0.0  3.0   f-32355261-ss3d   sarah
17  0.0  1.0           adasdfs   sarah
18 -1.0  2.0           adasdfs   sarah
19  0.0  3.0           adasdfs   sarah

关于python - 在 Pandas 群中抽样,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50929841/

10-09 17:17