问题描述
我有一个有3列的数据框 - location_id,customers,cluster。
之前,我按数据将数据聚类为5个群集。因此,群集列包含值[0,1,2,3,4]。
我想将每个群集分成2个片段,以供下一个测试阶段使用。例如。 50-50切片,或30-70切片,或20-80切片。
问题 - 如何应用一个函数,将列添加到
data.groupby('cluster')
?
理想结果
location_id客户集群切片
0 149213 132817 1 1
1 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 0
更新 p>
@MaxU的解决方案让我走上了正确的道路。解决方案包括使用dataframe.assign函数添加新列,并检查当前索引/总索引长度以分配正确比例的一部分。但是,下面的代码在某种程度上不适合我。
testgroup =(data.groupby('cluster' )
.apply(lambda x:x.assign(index1 =(np.arange(len(x))))
))
testgroup =(testgroup.groupby('cluster')
.apply(lambda x:x.assign(total_len = len(x))
))
testgroup ['is_slice'] =((testgroup ['index1']] / testgroup ['total_len'])
location_id customers cluster index1 total_len is_slice
0 149213 132817 1 0 12 True
1 578371 76655 1 1 12 True
2 91703 74048 1 2 12 True
3 154868 62397 1 3 12 True
4 1022759 59162 1 4 12 True
5 87016 58134 1 5 12 True
6 649432 56849 1 6 12假
7 219163 56802 1 7 12假
8 97704 54718 1 8 12假
9 248455 52806 1 9 12假
10 184828 52783 1 10 12假
11 152887 52565 1 11 12 False
试试这个:
让我们让您的示例DF变大一些:
In [31]:df = pd.concat ([df] * 3,ignore_index = True)
In [32]:df
Out [32]:
location_id customers cluster
0 149213 132817 1
1 578371 76655 1
2 91703 74048 2
3 154868 62397 2
4 1022759 59162 2
5 149213 132817 1
6 578371 76655 1
7 91703 74048 2
8 154868 62397 2
9 1022759 59162 2
10 149213 132817 1
11 578371 76655 1
12 91703 74048 2
13 154868 62397 2
14 1022759 59162 2
切片30-70:
在[34]中:(df.groupby('cluster')
...:.apply(lambda x:x.assign(slice =((np.arange(len(x ))/ len(x)) ...:.reset_index(level = 0,drop = True)
...:
...:
出[34]:
location_id客户集群切片
0 149213 132817 1 1
1 578371 76655 1 1
5 149213 132817 1 0
6 578371 76655 1 0
10 149213 132817 1 0
11 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 1
7 91703 74048 2 0
8 154868 62397 2 0
9 1022759 59162 2 0
12 91703 74048 2 0
13 154868 62397 2 0
14 1022759 59162 2 0
切片20-80:
In [35]:(df.groupby('cluster')
...:.apply(lambda x:x.assign(slice = ((np.arange(len(x))/ len(x))≤0.2).astype(np.uint8)))
...:.reset_index(level = 0,drop = True)
...)
...
Out [35]:
location_id客户集群切片
0 149213 132817 1 1
1 578371 76655 1 1
5 149213 132817 1 0
6 578371 76655 1 0
10 149213 132817 1 0
11 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 0
7 91703 74048 2 0
8 154868 62397 2 0
9 1022759 59162 2 0
12 91703 74048 2 0
13 154868 62397 2 0
14 1022759 59162 2 0
I have a Dataframe with 3 columns - location_id, customers, cluster.Previously, I clustered by data into 5 clusters. Hence, the cluster column contain values [0, 1, 2, 3, 4].
I would like to separate each cluster into 2 slices for my next stage of testing. E.g. 50-50 slice, or 30-70 slice, or 20-80 slice.
Question - How do I apply a function that adds a column todata.groupby('cluster')
?
Ideal Result
location_id customers cluster slice
0 149213 132817 1 1
1 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 0
Update
@MaxU's solution put me on the right path. The solution involves using dataframe.assign function to add a new column, and a check for current index/ total index length to assign a slice of the correct proportions. However, the code below somehow did not work for me. I ended up splitting up the @MaxU's solution into separate steps and that worked.
testgroup= (data.groupby('cluster')
.apply(lambda x: x.assign(index1=(np.arange(len(x))))
))
testgroup= (testgroup.groupby('cluster')
.apply(lambda x: x.assign(total_len=len(x))
))
testgroup['is_slice'] = ((testgroup['index1']/testgroup['total_len']) <= 0.5)
location_id customers cluster index1 total_len is_slice
0 149213 132817 1 0 12 True
1 578371 76655 1 1 12 True
2 91703 74048 1 2 12 True
3 154868 62397 1 3 12 True
4 1022759 59162 1 4 12 True
5 87016 58134 1 5 12 True
6 649432 56849 1 6 12 False
7 219163 56802 1 7 12 False
8 97704 54718 1 8 12 False
9 248455 52806 1 9 12 False
10 184828 52783 1 10 12 False
11 152887 52565 1 11 12 False
Try this:
let's make your sample DF bit larger:
In [31]: df = pd.concat([df] * 3, ignore_index=True)
In [32]: df
Out[32]:
location_id customers cluster
0 149213 132817 1
1 578371 76655 1
2 91703 74048 2
3 154868 62397 2
4 1022759 59162 2
5 149213 132817 1
6 578371 76655 1
7 91703 74048 2
8 154868 62397 2
9 1022759 59162 2
10 149213 132817 1
11 578371 76655 1
12 91703 74048 2
13 154868 62397 2
14 1022759 59162 2
slice 30-70:
In [34]: (df.groupby('cluster')
...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.3).astype(np.uint8)))
...: .reset_index(level=0, drop=True)
...: )
...:
Out[34]:
location_id customers cluster slice
0 149213 132817 1 1
1 578371 76655 1 1
5 149213 132817 1 0
6 578371 76655 1 0
10 149213 132817 1 0
11 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 1
7 91703 74048 2 0
8 154868 62397 2 0
9 1022759 59162 2 0
12 91703 74048 2 0
13 154868 62397 2 0
14 1022759 59162 2 0
slice 20-80:
In [35]: (df.groupby('cluster')
...: .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.2).astype(np.uint8)))
...: .reset_index(level=0, drop=True)
...: )
...:
Out[35]:
location_id customers cluster slice
0 149213 132817 1 1
1 578371 76655 1 1
5 149213 132817 1 0
6 578371 76655 1 0
10 149213 132817 1 0
11 578371 76655 1 0
2 91703 74048 2 1
3 154868 62397 2 1
4 1022759 59162 2 0
7 91703 74048 2 0
8 154868 62397 2 0
9 1022759 59162 2 0
12 91703 74048 2 0
13 154868 62397 2 0
14 1022759 59162 2 0
这篇关于按组创建数据帧分片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!