我有一个 df,它在重复的时间段包含许多 Places。这些 Places 以随机方式开始和结束。对于每个时间段,我想将每个唯一的位置分配给 Group 。这样做的核心规则是:

1) 每个 Group 在任何时候最多只能容纳 3 个唯一的 Places

2) 唯一的 Places 应该均匀分布在每个 Group

我选取了 df 的一小部分。有 7 个唯一值(但在任一时间出现的值不超过 5 个)和 2 个 Groups 可供选择。但实际上,df 总共可以包含多达 50 个唯一值,这些值将分布在最多 6 个组中的结束和结束以及不同的时间段。

为了了解当前有多少 Places 发生,我包含了一个 Total ,它基于 Place 是否再次出现。
df 包含每个 Groups 处每个唯一 Place 的所有可用 Period 。地方 GolfClub 将完成,但我们假设所有其他地方都将继续,因为它们稍后出现在 df 中。

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6],
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS'],
    'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5],
    'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1'],
    })

给我带来麻烦的主要问题是 Places 动态出现/存在。在那种情况下,它们以随机方式结束,新的开始。所以分配和分配 当前 唯一的 Places 需要考虑这个概念

试图:
def AssignPlace(df):
        uniquePlaces = df['Place'].unique()
        G3 = dict(zip(uniquePlaces, np.arange(len(uniquePlaces)) // 3 + 1))
        df['Assigned Group'] = df['Place'].map(G3)
        return df

df = df.groupby('Available Group', sort=False).apply(AssignPlace)
df = df.drop_duplicates(subset = ['Period','Place'])

出去:
    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1
1   2       CLUB    1      2               1
3   2       HOME    2      2               1
5   2       AWAY    3      2               1
7   3       WORK    4      2               2
9   3       AWAY    4      1               1
11  3       GOLF    5      1               2  #GOLF FINISHES SO 4 OCCURING FROM NEXT ROW
13  4       CLUB    4      2               1  #CLUB FINISHES BUT POOL STARTS SO STILL 4 OCCURING FROM NEXT ROW
15  4       POOL    4      2               2
17  4       HOME    4      2               1
19  5       WORK    4      2               2
21  5       AWAY    4      1               1
23  5       POOL    4      1               2
25  6       TENNIS  5      2               3  #Signifies issue

最后一行显示问题的开始。分配的组将此地点正确地测量为第 7 个唯一值,但它不考虑当前唯一值。随着 ClubGolf 完成,它们只有 5 个当前 unqiue 值和 2 个可用组。但它返回 Group 3 。因此,将继续计算每个新的唯一值,而不是考虑当前出现的唯一值。

预期输出,TENNIS 分配组现在是 1 ,而不是 3 :
    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1
1   2       CLUB    1      2               1
3   2       HOME    2      2               1
5   2       AWAY    3      2               1
7   3       WORK    4      2               2
9   3       AWAY    4      1               1
11  3       GOLF    5      1               2
13  4       CLUB    4      2               1
15  4       POOL    4      2               2
17  4       HOME    4      2               1
19  5       WORK    4      2               2
21  5       AWAY    4      1               1
23  5       POOL    4      1               2
25  6       TENNIS  5      2               1

最佳答案

这是我的尝试。
解释在代码注释上,如果不够,请在此处给我留言

注意 :我在底部添加了 5 个虚拟行来模拟这些位置将出现在 df 的后面。所以请忽略 period=0 的行

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,0,0,0,0,0],
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS', "AWAY","HOME","POOL","WORK", "TENNIS"],
#     'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5,0,0,0,0,0],
#     'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1',0,0,0,0,0],
    })

# df to store all unique places
uniquePlaces = pd.DataFrame(df["Place"].unique(), columns=["Place"])
# Start stores index of df where the place appears 1st
uniquePlaces["Start"] = -1
# End stores index of df where the place appears last
uniquePlaces["End"] = -1

## adds new column "Place Label" which is label encoded value for a place
## "Place Label" may not be necessary but it may improve performance when looking up and merging
## this function also updates Start and End of current label in group
def assign_place_label(group):
    label=uniquePlaces[uniquePlaces["Place"]==group.name].index[0]
    group["Place Label"] = label
    uniquePlaces.loc[label, "Start"] = group.index.min()
    uniquePlaces.loc[label, "End"] = group.index.max()
    return group

## based on Start and End of each place assign index to each place
## when a freed the index is reused to new place appearing after that
def get_dynamic_group(up):
    up["Index"] = 0
    up["Freed"] = False
    max_ind=0
    free_indx = []
    for i in range(len(up)):
        ind_freed = up.index[(up["End"]<up.iloc[i]["Start"]) & (~up["Freed"])]
        free = list(up.loc[ind_freed, "Index"])
        free_indx += free

        up.loc[ind_freed, "Freed"] = True


        if len(free_indx)>0:
            m = min(free_indx)
            up.loc[i, "Index"] = m
            free_indx.remove(m)

        else:
            up.loc[i, "Index"] = max_ind
            max_ind+=1

    up["Group"] = up["Index"]//3+1

    return up

df2 = df.groupby("Place").apply(assign_place_label)
uniquePlaces = get_dynamic_group(uniquePlaces)

display(uniquePlaces)

df3 = df2[df2.Period!=0].drop_duplicates(subset = ['Period','Place'])
result = df3.merge(uniquePlaces[["Group"]], how="left", left_on="Place Label",
                   right_index=True, sort=False)
display(result)

输出
    Period  Place   Place Label Group
0   1   CLUB    0   1
1   2   CLUB    0   1
3   2   HOME    1   1
5   2   AWAY    2   1
7   3   WORK    3   2
9   3   AWAY    2   1
11  3   GOLF    4   2
13  4   CLUB    0   1
15  4   POOL    5   2
17  4   HOME    1   1
19  5   WORK    3   2
21  5   AWAY    2   1
23  5   POOL    5   2
25  6   TENNIS  6   1

关于python - 从不同的选项分配值 - Pandas ,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58021749/

10-12 06:51