本文介绍了一次应用多个功能到Pandas groupby对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题的变化已被问到(请参阅),但我还没有找到一个好的解决方案,似乎是Pandas中常见的 groupby 用例。

假设我有 lasts 和I group by user

  lasts = pd.DataFrame({'user':['a','s','d' ,'d'],
'elapsed_time':[40000,50000,60000,90000],
'running_time':[30000,20000,30000,15000],
'num_cores': [7,8,9,4]})

我有这些功能我想应用于 groupby_obj (这些函数做的并不重要,我做了它们,只知道它们需要数据框中的多列):

  def custom_func(group):
return group.running_time.median() - group.num_cores.mean()

def custom_func2(group):
return max(group.elapsed_time)-min(group.running_time)





我还没有真正找到一个,虽然(在页面底部搜索创建一个获取组的统计信息的函数)建议将函数作为一个字典封装到一个函数中:

  def get_stats(group):
return {'custom_column_1':custom_func(group),'custom_column_2':custom_func2(group)}

但是,当我运行代码 groupby_obj.apply(get_stats) ,而不是我得到字典结果的列:

  user 
a {' custom_column_1':29993.0,'custom_column_2'...
d {'custom_column_1':22493.5,'custom_column_2'...
s {'custom_column_1':19992.0,'custom_column_2'...
dtype :object

实际上我想用一个以获得更接近此数据框的代码行:

  user custom_column_1 custom_column_2 
a 29993.0 10000
d 22493.5 75000
s 19992.0 30000

改善此工作流程的建议?

解决方案
  def get_stats(group):
return pd.Series({'custom_column_1':custom_func(group),
'custom_column_2':custom_func2(group)} )

现在您可以简单地执行此操作:

 在[202]中:lasts.groupby('user')。apply(get_stats).reset_index()
Out [202]:
user custom_column_1 custom_column_2
0 a 29993.0 10000.0
1 d 22493.5 75000.0
2 s 19992.0 30000.0






使用您的函数的替代(比较丑陋的)方法(不变):

<$ p $在[188]中:pd.DataFrame(lasts.groupby('user')
.apply(get_stats).to_dict())\
.T \
.rename_axis('user')\
.reset_index()
Out [188]:
user custom_column_1 custom_column_2
0 a 29993.0 10000.0
1 d 22493.5 75000.0
2 s 19992.0 30000.0


Variations of this question have been asked (see this question) but I haven't found a good solution for would seem to be a common use-case of groupby in Pandas.

Say I have the dataframe lasts and I group by user:

lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

And I have these functions I want to apply to groupby_obj (what the functions do isn't important and I made them up, just know that they require multiple columns from the dataframe):

def custom_func(group):
    return group.running_time.median() - group.num_cores.mean()

def custom_func2(group):
    return max(group.elapsed_time) -min(group.running_time)

I could apply each of these functions separately to the dataframe and then merge the resulting dataframes, but that seems inefficient, is inelegant, and I imagine there has to be a one-line solution.

I haven't really found one, although this blog post (search for "Create a function to get the stats of a group" towards the bottom of the page) suggested wrapping the functions into one function as a dictionary thusly:

def get_stats(group):
    return {'custom_column_1': custom_func(group), 'custom_column_2':custom_func2(group)}

However, when I run the code groupby_obj.apply(get_stats), instead of columns I get a column of dictionary results:

user
a    {'custom_column_1': 29993.0, 'custom_column_2'...
d    {'custom_column_1': 22493.5, 'custom_column_2'...
s    {'custom_column_1': 19992.0, 'custom_column_2'...
dtype: object

When in reality I would like to use a line of code to get something closer to this dataframe:

user custom_column_1    custom_column_2
a    29993.0                10000
d    22493.5                75000
s    19992.0                30000

Suggestions on improving this workflow?

解决方案

If you would slightly modify the get_stats function:

def get_stats(group):
    return pd.Series({'custom_column_1': custom_func(group),
                      'custom_column_2':custom_func2(group)})

now you can simply do this:

In [202]: lasts.groupby('user').apply(get_stats).reset_index()
Out[202]:
  user  custom_column_1  custom_column_2
0    a          29993.0          10000.0
1    d          22493.5          75000.0
2    s          19992.0          30000.0


Alternative (bit ugly) approach which uses your functions (unchanged):

In [188]: pd.DataFrame(lasts.groupby('user')
                            .apply(get_stats).to_dict()) \
            .T \
            .rename_axis('user') \
            .reset_index()
Out[188]:
  user  custom_column_1  custom_column_2
0    a          29993.0          10000.0
1    d          22493.5          75000.0
2    s          19992.0          30000.0

这篇关于一次应用多个功能到Pandas groupby对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 19:40