问题描述
我有以下熊猫数据框(只是一小部分):
GROUP AVG_PERCENT_EVAL_1 AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 19 11.000000 25.000000 163.000000
AAAAA 22 2.000000 146.364198 332.761317
AAAAA 23 0.500000 44.068225 302.708639
AAAAA 24 1.000000 122.672215 322.359795
AAAAA 26 1.000000 143.594896 317.940989
BBBBB 18 1.000000 121.225692 319.292226
BBBBB 19 1.000000 40.054707 201.096152
BBBBB 21 0.333333 29.221458 207.142059
BBBBB 27 2.000000 103.796290 313.685358
我需要使AVG_PERCENT_EVAL_1
更加连续,这意味着,而不是确切的值,例如18
,19
,20
等,我想放置范围,例如18-20
,21-23
等,直到40岁左右.
应按GROUP
对数据进行分组,然后应将每个相应范围的AVG_PERCENT_NEGATIVE
,AVG_TOTAL_WAIT_TIME
和AVG_TOTAL_SERVICE_TIME
平均.
重要提示:让我们以范围18-20
为例.组AAA
具有对应的条目,其中AVG_PERCENT_EVAL_1
等于19
,而组BBBBB
具有两个处于此范围内的条目-18
和19
. AVG_PERCENT_NEGATIVE
,AVG_TOTAL_WAIT_TIME
和AVG_TOTAL_SERVICE_TIME
的值应为以下值:
GROUP AVG_PERCENT_RANGE AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 18-20 11.00 25.00 163.000000
BBBBB 18-20 1.00 80.64 260,19
我知道如何按具体列对数据进行分组,然后使用agg
计算平均值"或计数".但是,在这种情况下,我不知道如何为AVG_PERCENT_EVAL_1
创建范围.另外,我不知道如何定义缺少的条目不应被视为0
.例如,在上面的示例中,没有有关AAAAA
且AVG_PERCENT_EVAL_1
等于18
和20
的信息,因此我只想取19
的值而没有求平均值18
和20
的0
值.
步骤:
1)使用AVG_PERCENT_EVAL_1 绑定到适当的标签中通过指定bin
序列来nofollow noreferrer> pd.cut()
.
指定include_lowest=True
将注意左端点"["
的包容性,而right=False
将使右端点成为开放间隔")"
.
2)使用返回的类别,根据需要重新标记它们.
3)Peform groupby
将 GROUP 和新计算的合并范围作为分组键,在从其中删除 AVG_PERCENT_EVAL_1 后汇总所有当前列的均值. /p>
合并部分:
step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels
分配和groupby.agg()
:
df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')
I have the following pandas dataframe (it is just a small extract):
GROUP AVG_PERCENT_EVAL_1 AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 19 11.000000 25.000000 163.000000
AAAAA 22 2.000000 146.364198 332.761317
AAAAA 23 0.500000 44.068225 302.708639
AAAAA 24 1.000000 122.672215 322.359795
AAAAA 26 1.000000 143.594896 317.940989
BBBBB 18 1.000000 121.225692 319.292226
BBBBB 19 1.000000 40.054707 201.096152
BBBBB 21 0.333333 29.221458 207.142059
BBBBB 27 2.000000 103.796290 313.685358
I need to make AVG_PERCENT_EVAL_1
more continuous, which means that, instead of exact values e.g. 18
, 19
, 20
, etc., I want to put ranges e.g. 18-20
,21-23
, and so on till around 40.
The data should be grouped by GROUP
and then AVG_PERCENT_NEGATIVE
, AVG_TOTAL_WAIT_TIME
and AVG_TOTAL_SERVICE_TIME
should be averaged for each corresponding range.
IMPORTANT: Let's take the range 18-20
. The group AAA
has corresponding entry with AVG_PERCENT_EVAL_1
equal to 19
, while the group BBBBB
has two entries falling in this range - 18
and 19
. The values of AVG_PERCENT_NEGATIVE
, AVG_TOTAL_WAIT_TIME
and AVG_TOTAL_SERVICE_TIME
should be these ones:
GROUP AVG_PERCENT_RANGE AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 18-20 11.00 25.00 163.000000
BBBBB 18-20 1.00 80.64 260,19
I know how to group data by concrete columns and then calculate 'mean' or 'count' using agg
. However, in this case I don't know how to create ranges for AVG_PERCENT_EVAL_1
. Also, I don't know how to define that lacking entries should not be considered as 0
. For instance, in the above example there is no information about AAAAA
with AVG_PERCENT_EVAL_1
equal to 18
and 20
, therefore I just want to take the values for 19
without averaging with 0
values for 18
and 20
.
Steps:
1) Bin AVG_PERCENT_EVAL_1 into appropriate labels using pd.cut()
by specifying a bin
sequence.
Specifying include_lowest=True
would take care of inclusiveness of the left endpoint "["
whereas right=False
would make the right endpoint an open interval ")"
.
2) Using the returned categories, re-label them as per desired requirements.
3) Peform groupby
making GROUP and the newly computed binned ranges as the grouped key, aggregate the means of all present columns after dropping AVG_PERCENT_EVAL_1 from them.
binning portion:
step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels
assign and groupby.agg()
:
df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')
这篇关于如何按范围分组数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!