按组创建数据帧分片

本文介绍了按组创建数据帧分片的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个有3列的数据框 - location_id，customers，cluster。
之前，我按数据将数据聚类为5个群集。因此，群集列包含值[0,1,2,3,4]。

我想将每个群集分成2个片段，以供下一个测试阶段使用。例如。 50-50切片，或30-70切片，或20-80切片。

问题 - 如何应用一个函数，将列添加到
data.groupby（'cluster'）？

理想结果

  location_id客户集群切片
 0 149213 132817 1 1 
 1 578371 76655 1 0 
 2 91703 74048 2 1 
 3 154868 62397 2 1 
 4 1022759 59162 2 0

更新 p>

@MaxU的解决方案让我走上了正确的道路。解决方案包括使用dataframe.assign函数添加新列，并检查当前索引/总索引长度以分配正确比例的一部分。但是，下面的代码在某种程度上不适合我。

  testgroup =（data.groupby（'cluster' ）
 .apply（lambda x：x.assign（index1 =（np.arange（len（x））））
））
 testgroup =（testgroup.groupby（'cluster'） 
 .apply（lambda x：x.assign（total_len = len（x））
））
 
 testgroup ['is_slice'] =（（testgroup ['index1']] / testgroup ['total_len']） 
 location_id customers cluster index1 total_len is_slice 
 
 0 149213 132817 1 0 12 True 
 1 578371 76655 1 1 12 True 
 2 91703 74048 1 2 12 True 
 3 154868 62397 1 3 12 True 
 4 1022759 59162 1 4 12 True 
 5 87016 58134 1 5 12 True 
 6 649432 56849 1 6 12假
 7 219163 56802 1 7 12假
 8 97704 54718 1 8 12假
 9 248455 52806 1 9 12假
 10 184828 52783 1 10 12假
 11 152887 52565 1 11 12 False

解决方案

试试这个：

让我们让您的示例DF变大一些：

  In [31]：df = pd.concat （[df] * 3，ignore_index = True）
 
 In [32]：df 
 Out [32]：
 location_id customers cluster 
 0 149213 132817 1 
 1 578371 76655 1 
 2 91703 74048 2 
 3 154868 62397 2 
 4 1022759 59162 2 
 5 149213 132817 1 
 6 578371 76655 1 
 7 91703 74048 2 
 8 154868 62397 2 
 9 1022759 59162 2 
 10 149213 132817 1 
 11 578371 76655 1 
 12 91703 74048 2 
 13 154868 62397 2 
 14 1022759 59162 2

切片30-70：

 在[34]中：（df.groupby（'cluster'）
 ...：.apply（lambda x：x.assign（slice =（（np.arange（len（x ））/ len（x）） ...：.reset_index（level = 0，drop = True）
 ...： 
 ...：
出[34]：
 location_id客户集群切片
 0 149213 132817 1 1 
 1 578371 76655 1 1 
 5 149213 132817 1 0 
 6 578371 76655 1 0 
 10 149213 132817 1 0 
 11 578371 76655 1 0 
 2 91703 74048 2 1 
 3 154868 62397 2 1 
 4 1022759 59162 2 1 
 7 91703 74048 2 0 
 8 154868 62397 2 0 
 9 1022759 59162 2 0 
 12 91703 74048 2 0 
 13 154868 62397 2 0 
 14 1022759 59162 2 0

切片20-80：

  In [35]：（df.groupby（'cluster'）
 ...：.apply（lambda x：x.assign（slice = （（np.arange（len（x））/ len（x））≤0.2）.astype（np.uint8）））
 ...：.reset_index（level = 0，drop = True） 
 ...）
 ... 
 Out [35]：
 location_id客户集群切片
 0 149213 132817 1 1 
 1 578371 76655 1 1 
 5 149213 132817 1 0 
 6 578371 76655 1 0 
 10 149213 132817 1 0 
 11 578371 76655 1 0 
 2 91703 74048 2 1 
 3 154868 62397 2 1 
 4 1022759 59162 2 0 
 7 91703 74048 2 0 
 8 154868 62397 2 0 
 9 1022759 59162 2 0 
 12 91703 74048 2 0 
 13 154868 62397 2 0 
 14 1022759 59162 2 0

I have a Dataframe with 3 columns - location_id, customers, cluster.Previously, I clustered by data into 5 clusters. Hence, the cluster column contain values [0, 1, 2, 3, 4].

I would like to separate each cluster into 2 slices for my next stage of testing. E.g. 50-50 slice, or 30-70 slice, or 20-80 slice.

Question - How do I apply a function that adds a column todata.groupby('cluster')?

Ideal Result

  location_id        customers  cluster  slice
0       149213        132817       1       1
1       578371        76655        1       0
2        91703        74048        2       1 
3       154868        62397        2       1
4      1022759        59162        2       0

Update

@MaxU's solution put me on the right path. The solution involves using dataframe.assign function to add a new column, and a check for current index/ total index length to assign a slice of the correct proportions. However, the code below somehow did not work for me. I ended up splitting up the @MaxU's solution into separate steps and that worked.

testgroup= (data.groupby('cluster')
.apply(lambda x: x.assign(index1=(np.arange(len(x))))
))
testgroup= (testgroup.groupby('cluster')
.apply(lambda x: x.assign(total_len=len(x))
))

testgroup['is_slice'] = ((testgroup['index1']/testgroup['total_len']) <= 0.5)

            location_id  customers  cluster  index1  total_len  is_slice

    0        149213        132817        1     0     12   True
    1        578371         76655        1     1     12   True
    2         91703         74048        1     2     12   True
    3        154868         62397        1     3     12   True
    4       1022759         59162        1     4     12   True
    5         87016         58134        1     5     12   True
    6        649432         56849        1     6     12   False
    7        219163         56802        1     7     12   False
    8         97704         54718        1     8     12   False
    9        248455         52806        1     9     12   False
    10       184828         52783        1    10     12   False
    11       152887         52565        1    11     12   False

解决方案

Try this:

let's make your sample DF bit larger:

In [31]: df = pd.concat([df] * 3, ignore_index=True)

In [32]: df
Out[32]:
    location_id  customers  cluster
0        149213     132817        1
1        578371      76655        1
2         91703      74048        2
3        154868      62397        2
4       1022759      59162        2
5        149213     132817        1
6        578371      76655        1
7         91703      74048        2
8        154868      62397        2
9       1022759      59162        2
10       149213     132817        1
11       578371      76655        1
12        91703      74048        2
13       154868      62397        2
14      1022759      59162        2

slice 30-70:

In [34]: (df.groupby('cluster')
    ...:    .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.3).astype(np.uint8)))
    ...:    .reset_index(level=0, drop=True)
    ...: )
    ...:
Out[34]:
    location_id  customers  cluster  slice
0        149213     132817        1      1
1        578371      76655        1      1
5        149213     132817        1      0
6        578371      76655        1      0
10       149213     132817        1      0
11       578371      76655        1      0
2         91703      74048        2      1
3        154868      62397        2      1
4       1022759      59162        2      1
7         91703      74048        2      0
8        154868      62397        2      0
9       1022759      59162        2      0
12        91703      74048        2      0
13       154868      62397        2      0
14      1022759      59162        2      0

slice 20-80:

In [35]: (df.groupby('cluster')
    ...:    .apply(lambda x: x.assign(slice=((np.arange(len(x))/len(x)) <= 0.2).astype(np.uint8)))
    ...:    .reset_index(level=0, drop=True)
    ...: )
    ...:
Out[35]:
    location_id  customers  cluster  slice
0        149213     132817        1      1
1        578371      76655        1      1
5        149213     132817        1      0
6        578371      76655        1      0
10       149213     132817        1      0
11       578371      76655        1      0
2         91703      74048        2      1
3        154868      62397        2      1
4       1022759      59162        2      0
7         91703      74048        2      0
8        154868      62397        2      0
9       1022759      59162        2      0
12        91703      74048        2      0
13       154868      62397        2      0
14      1022759      59162        2      0

这篇关于按组创建数据帧分片的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Cluster

按组创建数据帧分片

问题描述