问题描述
在 pandas.DataFrame.groupby 中,是参数group_keys,我应该收集该参数做一些与如何在数据帧子集中包含组键有关的事情.根据文档:
In pandas.DataFrame.groupby, there is an argument group_keys, which I gather is supposed to do something relating to how group keys are included in the dataframe subsets. According to the documentation:
但是,我找不到真正使group_keys产生实际差异的任何示例:
However, I can't really find any examples where group_keys makes an actual difference:
import pandas as pd df = pd.DataFrame([[0, 1, 3], [3, 1, 1], [3, 0, 0], [2, 3, 3], [2, 1, 0]], columns=list('xyz')) gby = df.groupby('x') gby_k = df.groupby('x', group_keys=False)
它对apply的输出没有影响:
ap = gby.apply(pd.DataFrame.sum) # x y z # x # 0 0 1 3 # 2 4 4 3 # 3 6 1 1 ap_k = gby_k.apply(pd.DataFrame.sum) # x y z # x # 0 0 1 3 # 2 4 4 3 # 3 6 1 1
即使您在打印时打印出分组的子集,结果仍然是相同的:
And even if you print out the grouped subsets as you go, the results are still identical:
def printer_func(x): print(x) return x print('gby') print('--------------') gby.apply(printer_func) print('--------------') print('gby_k') print('--------------') gby_k.apply(printer_func) print('--------------') # gby # -------------- # x y z # 0 0 1 3 # x y z # 0 0 1 3 # x y z # 3 2 3 3 # 4 2 1 0 # x y z # 1 3 1 1 # 2 3 0 0 # -------------- # gby_k # -------------- # x y z # 0 0 1 3 # x y z # 0 0 1 3 # x y z # 3 2 3 3 # 4 2 1 0 # x y z # 1 3 1 1 # 2 3 0 0 # --------------
我考虑了默认参数实际上是True的可能性,但是将group_keys切换为显式False也没有任何区别.这种说法到底是为了什么?
I considered the possibility that the default argument is actually True, but switching group_keys to explicitly False doesn't make a difference either. What exactly is this argument for?
(在pandas版本0.18.1上运行)
修改:我确实找到了一种方法,根据此答案:
I did find a way where group_keys changes behavior, based on this answer:
import pandas as pd import numpy as np row_idx = pd.MultiIndex.from_product(((0, 1), (2, 3, 4))) d = pd.DataFrame([[4, 3], [1, 3], [1, 1], [2, 4], [0, 1], [4, 2]], index=row_idx) df_n = d.groupby(level=0).apply(lambda x: x.nlargest(2, [0])) # 0 1 # 0 0 2 4 3 # 3 1 3 # 1 1 4 4 2 # 2 2 4 df_k = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, [0])) # 0 1 # 0 2 4 3 # 3 1 3 # 1 4 4 2 # 2 2 4
但是,对于应该做什么 背后的可理解原理,我仍然不清楚.根据 @piRSquared 的答案,这种行为似乎并不直观.
However, I'm still not clear on the intelligible principle behind what group_keys is supposed to do. This behavior does not seem intuitive based on @piRSquared's answer.
推荐答案
group_keys参数.html"rel =" noreferrer> groupby 在 apply 操作可创建对应于分组列[group_keys=True]的附加索引列,并在情况[group_keys=False]中消除,特别是在尝试对以下内容执行操作的情况下单独的列.
group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns[group_keys=True] and eliminates in the case[group_keys=False] especially during the case when trying to perform operations on individual columns.
一个这样的实例:
In [21]: gby = df.groupby('x',group_keys=True).apply(lambda row: row['x']) In [22]: gby Out[22]: x 0 0 0 2 3 2 4 2 3 1 3 2 3 Name: x, dtype: int64 In [23]: gby_k = df.groupby('x', group_keys=False).apply(lambda row: row['x']) In [24]: gby_k Out[24]: 0 0 3 2 4 2 1 3 2 3 Name: x, dtype: int64
其中一项预期的应用程序是将其转换为Multi-index数据框对象,从而按层次结构的某一级别分组.
One of it's intended application could be to group by one of the levels of the hierarchy by converting it to a Multi-index dataframe object.
In [27]: gby.groupby(level='x').sum() Out[27]: x 0 0 2 4 3 6 Name: x, dtype: int64
这篇关于pandas.groupby的group_keys参数实际上是做什么的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!