一次应用多个功能到Pandas groupby对象

本文介绍了一次应用多个功能到Pandas groupby对象的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这个问题的变化已被问到（请参阅），但我还没有找到一个好的解决方案，似乎是Pandas中常见的 groupby 用例。

假设我有 lasts 和I group by user ：

  lasts = pd.DataFrame（{'user'：['a'，'s'，'d' ，'d']，
'elapsed_time'：[40000,50000,60000,90000]，
'running_time'：[30000,20000,30000,15000]，
'num_cores'： [7,8,9,4]}）

我有这些功能我想应用于 groupby_obj （这些函数做的并不重要，我做了它们，只知道它们需要数据框中的多列）：

  def custom_func（group）：
 return group.running_time.median（） -  group.num_cores.mean（）
 
 def custom_func2（group）：
 return max（group.elapsed_time）-min（group.running_time）

我还没有真正找到一个，虽然（在页面底部搜索创建一个获取组的统计信息的函数）建议将函数作为一个字典封装到一个函数中：

  def get_stats（group）：
 return {'custom_column_1'：custom_func（group），'custom_column_2'：custom_func2（group）}

但是，当我运行代码 groupby_obj.apply（get_stats） ，而不是我得到字典结果的列：

  user 
a {' custom_column_1'：29993.0，'custom_column_2'... 
d {'custom_column_1'：22493.5，'custom_column_2'... 
s {'custom_column_1'：19992.0，'custom_column_2'... 
 dtype ：object

实际上我想用一个以获得更接近此数据框的代码行：

  user custom_column_1 custom_column_2 
a 29993.0 10000 
d 22493.5 75000 
s 19992.0 30000

改善此工作流程的建议？

解决方案

  def get_stats（group）：
 return pd.Series（{'custom_column_1'：custom_func（group），
'custom_column_2'：custom_func2（group）} ）

现在您可以简单地执行此操作：

 在[202]中：lasts.groupby（'user'）。apply（get_stats）.reset_index（）
 Out [202]：
 user custom_column_1 custom_column_2 
 0 a 29993.0 10000.0 
 1 d 22493.5 75000.0 
 2 s 19992.0 30000.0

使用您的函数的替代（比较丑陋的）方法（不变）：

<$ p $在[188]中：pd.DataFrame（lasts.groupby（'user'）
.apply（get_stats）.to_dict（））\
.T \
.rename_axis（'user'）\
.reset_index（）
Out [188]：
user custom_column_1 custom_column_2
0 a 29993.0 10000.0
1 d 22493.5 75000.0
2 s 19992.0 30000.0

Variations of this question have been asked (see this question) but I haven't found a good solution for would seem to be a common use-case of groupby in Pandas.
Say I have the dataframe lasts and I group by user:
lasts = pd.DataFrame({'user':['a','s','d','d'], 'elapsed_time':[40000,50000,60000,90000], 'running_time':[30000,20000,30000,15000], 'num_cores':[7,8,9,4]})
And I have these functions I want to apply to groupby_obj (what the functions do isn't important and I made them up, just know that they require multiple columns from the dataframe):
def custom_func(group): return group.running_time.median() - group.num_cores.mean() def custom_func2(group): return max(group.elapsed_time) -min(group.running_time)
I could apply each of these functions separately to the dataframe and then merge the resulting dataframes, but that seems inefficient, is inelegant, and I imagine there has to be a one-line solution.
I haven't really found one, although this blog post (search for "Create a function to get the stats of a group" towards the bottom of the page) suggested wrapping the functions into one function as a dictionary thusly:
def get_stats(group): return {'custom_column_1': custom_func(group), 'custom_column_2':custom_func2(group)}
However, when I run the code groupby_obj.apply(get_stats), instead of columns I get a column of dictionary results:
user a {'custom_column_1': 29993.0, 'custom_column_2'... d {'custom_column_1': 22493.5, 'custom_column_2'... s {'custom_column_1': 19992.0, 'custom_column_2'... dtype: object
When in reality I would like to use a line of code to get something closer to this dataframe:
user custom_column_1 custom_column_2 a 29993.0 10000 d 22493.5 75000 s 19992.0 30000
Suggestions on improving this workflow?
解决方案
If you would slightly modify the get_stats function:
def get_stats(group): return pd.Series({'custom_column_1': custom_func(group), 'custom_column_2':custom_func2(group)})
now you can simply do this:
In [202]: lasts.groupby('user').apply(get_stats).reset_index() Out[202]: user custom_column_1 custom_column_2 0 a 29993.0 10000.0 1 d 22493.5 75000.0 2 s 19992.0 30000.0
Alternative (bit ugly) approach which uses your functions (unchanged):
In [188]: pd.DataFrame(lasts.groupby('user') .apply(get_stats).to_dict()) \ .T \ .rename_axis('user') \ .reset_index() Out[188]: user custom_column_1 custom_column_2 0 a 29993.0 10000.0 1 d 22493.5 75000.0 2 s 19992.0 30000.0

这篇关于一次应用多个功能到Pandas groupby对象的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！