我有一个包含user和session列的dataframe,我想随机抽样会话的数量,以便dataframe包含每个用户N个唯一的会话。会话的顺序很重要,即必须保留每个会话的“in”列。
例如,如果N=2
和我有:
x in session_id user_id
0 0.0 1.0 trn-04a23351-283d paul
1 -1.0 2.0 trn-04a23351-283d paul
2 -1.0 3.0 trn-04a23351-283d paul
3 -1.0 4.0 trn-04a23351-283d paul
4 -1.0 1.0 blz-412313we-333 paul
5 -1.0 2.0 blz-412313we-333 paul
6 0.0 3.0 blz-412313we-333 paul
7 -1.0 1.0 wha-111111-fff paul
8 0.0 2.0 wha-111111-fff paul
9 1.0 1.0 bz-0000-01101 chris
10 0.0 2.0 bz-0000-01101 chris
11 -1.0 1.0 1111-sawas-1221 chris
12 -1.0 2.0 1111-sawas-1221 chris
13 1.0 1.0 pppppppppppppppp chris
14 1.0 2.0 pppppppppppppppp chris
15 1.0 3.0 pppppppppppppppp chris
16 -1.0 1.0 55555555555555555 philip
17 -1.0 2.0 55555555555555555 philip
18 -1.0 3.0 55555555555555555 philip
19 -1.0 1.0 333333333333333 philip
20 -1.0 2.0 333333333333333 philip
21 -1.0 3.0 333333333333333 philip
22 0.0 1.0 zz-222222222 philip
23 -1.0 2.0 zz-222222222 philip
24 0.0 1.0 f-32355261-ss3d sarah
25 -1.0 2.0 f-32355261-ss3d sarah
26 0.0 3.0 f-32355261-ss3d sarah
27 0.0 1.0 adasdfs sarah
28 -1.0 2.0 adasdfs sarah
29 0.0 3.0 adasdfs sarah
我想要:
x in session_id user_id
0 0.0 1.0 trn-04a23351-283d paul
1 -1.0 2.0 trn-04a23351-283d paul
2 -1.0 3.0 trn-04a23351-283d paul
3 -1.0 4.0 trn-04a23351-283d paul
4 -1.0 1.0 blz-412313we-333 paul
5 -1.0 2.0 blz-412313we-333 paul
6 0.0 3.0 blz-412313we-333 paul
7 1.0 1.0 bz-0000-01101 chris
8 0.0 2.0 bz-0000-01101 chris
9 1.0 1.0 pppppppppppppppp chris
10 1.0 2.0 pppppppppppppppp chris
11 1.0 3.0 pppppppppppppppp chris
12 -1.0 1.0 333333333333333 philip
13 -1.0 2.0 333333333333333 philip
14 -1.0 3.0 333333333333333 philip
15 0.0 1.0 zz-222222222 philip
16 -1.0 2.0 zz-222222222 philip
17 0.0 1.0 f-32355261-ss3d sarah
18 -1.0 2.0 f-32355261-ss3d sarah
19 0.0 3.0 f-32355261-ss3d sarah
20 0.0 1.0 adasdfs sarah
21 -1.0 2.0 adasdfs sarah
22 0.0 3.0 adasdfs sarah
最佳答案
创建要与之合并的引用数据框
d = df[['session_id', 'user_id']].drop_duplicates()
d = d.groupby('user_id', as_index=False).apply(pd.DataFrame.sample, n=2)
df.merge(d)
x in session_id user_id
0 -1.0 1.0 blz-412313we-333 paul
1 -1.0 2.0 blz-412313we-333 paul
2 0.0 3.0 blz-412313we-333 paul
3 -1.0 1.0 wha-111111-fff paul
4 0.0 2.0 wha-111111-fff paul
5 1.0 1.0 bz-0000-01101 chris
6 0.0 2.0 bz-0000-01101 chris
7 -1.0 1.0 1111-sawas-1221 chris
8 -1.0 2.0 1111-sawas-1221 chris
9 -1.0 1.0 333333333333333 philip
10 -1.0 2.0 333333333333333 philip
11 -1.0 3.0 333333333333333 philip
12 0.0 1.0 zz-222222222 philip
13 -1.0 2.0 zz-222222222 philip
14 0.0 1.0 f-32355261-ss3d sarah
15 -1.0 2.0 f-32355261-ss3d sarah
16 0.0 3.0 f-32355261-ss3d sarah
17 0.0 1.0 adasdfs sarah
18 -1.0 2.0 adasdfs sarah
19 0.0 3.0 adasdfs sarah
关于python - 在 Pandas 群中抽样,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50929841/