Let us say I have a pandas dataframe of this type (minimal example):
myDf = pd.DataFrame({'user': ['A'','B', 'C', 'D', 'E']*2,'date': ['2017-05-25']*5+['2017-05-26']*5,'nVisits':[10,2,3,0,0,6,0,4,8,1]})
在桌子上看起来像:
date nVisits user
5/25/2017 10 A
5/25/2017 2 B
5/25/2017 3 C
5/25/2017 0 D
5/25/2017 0 E
5/26/2017 6 A
5/26/2017 0 B
5/26/2017 4 C
5/26/2017 8 D
5/26/2017 1 E
(1)我想将我的用户每天分为4个存储桶:0次访问,1次访问,2-4次访问,5+次访问,所以我想创建一个如下所示的数据框摘要:
date group nVisits nObs
5/25/2017 zero 0 2
5/25/2017 one 0 0
5/25/2017 twoToFour 2 2
5/25/2017 fivePlus 10 1
5/26/2017 zero 0 1
5/26/2017 one 1 1
5/26/2017 twoToFour 4 1
5/26/2017 fivePlus 16 2
这个数据框架基本上是每个bucket的观察数以及每个bucket的访问数,其中哪些用户属于哪个bucket,每天更新一次。
(2)我想对所有客户的出生和死亡进行分类,其中,出生被分类为从0次就诊到>1次就诊的客户,死亡被分类为从>1次就诊到0次就诊的客户。
In this specific example the new dataframe would look like this:
date event_type user nVisitsAtBirthDeath
5/26/2017 death B 2
5/26/2017 birth D 8
5/26/2017 birth E 1
。
。。
最佳答案
我将使用pd.cut()方法:
In [29]: df['group'] = pd.cut(df.nVisits,
[-1, 0, 1, 4, np.inf],
labels=['zero','one','twoToFour','fivePlus'])
In [30]: df
Out[30]:
date nVisits user group
0 2017-05-25 10 A fivePlus
1 2017-05-25 2 B twoToFour
2 2017-05-25 3 C twoToFour
3 2017-05-25 0 D zero
4 2017-05-25 0 E zero
5 2017-05-26 6 A fivePlus
6 2017-05-26 0 B zero
7 2017-05-26 4 C twoToFour
8 2017-05-26 8 D fivePlus
9 2017-05-26 1 E one