本文介绍了Python Pandas:使用groupby()和agg()时是否保留顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常使用pandas的agg()函数在data.frame的每一列上运行摘要统计信息.例如,这是您产生均值和标准差的方法:

I've frequented used pandas' agg() function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:

df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102]})

>>> df
[output]
        A   B    C
0  group1  10  100
1  group1  12  102
2  group2  10  100
3  group2  25  250
4  group3  10  100
5  group3  12  102

在这两种情况下,将单独的行发送到agg函数的顺序都没有关系.但是,请考虑以下示例,

In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:

df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])

[output]

        mean  <lambda>  mean  <lambda>
A
group1  11.0        12   101       102
group2  17.5        25   175       250
group3  11.0        12   101       102

在这种情况下,lambda会按预期运行,输出每个组中的第二行.但是,我在pandas文档中找不到任何暗示可以保证在所有情况下都是正确的东西.我想将agg()与加权平均函数一起使用,所以我想确保函数中的行的顺序与原始数据帧中出现的顺序相同.

In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg() along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.

有没有人知道,最好是通过docs或pandas源代码中的某个地方,是否可以保证确实如此?

Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?

推荐答案

查看此增强功能问题

简短的回答是,groupby将保留传递的顺序.您可以使用以下示例来证明这一点:

The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]:
           B             C
        mean <lambda> mean <lambda>
A
group1  11.0       10  101      100
group2  17.5       10  175      100
group3  11.0       10  101      100

这对于重新采样不是正确的,但是因为它需要单调索引(它将与非单调索引一起使用,但是将首先对其进行排序).

This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

它们是groupby的sort=标志,但这与组本身的排序有关,与组内的观察结果无关.

Their is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

FYI:df.groupby('A').nth(1)是获取组的第二个值的安全方法(因为如果组中有< 2个元素,则上述方法将失败)

FYI: df.groupby('A').nth(1) is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

这篇关于Python Pandas:使用groupby()和agg()时是否保留顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 13:51