问题描述
我有一个二级分层索引的整数序列.
I have a two-level hierarchically-indexed sequence of integers.
>> s
id1 id2
1 a 100
b 10
c 9
2 a 2000
3 a 5
b 10
c 15
d 20
...
我想按id1分组,然后选择最大值,但结果中要有 full 索引.我尝试了以下方法:
I want to group by id1, and select the maximum value, but have the full index in the result. I have tried the following:
>> s.groupby(level=0).aggregate(np.max)
id1
1 100
2 2000
3 20
但是结果仅由id1索引.我希望我的输出看起来像这样:
But result is indexed by id1 only. I want my output to look like this:
id1 id2
1 a 100
2 a 2000
3 d 20
在此提出了一个相关但更复杂的问题:多索引熊猫groupby,忽略一个级别吗?正如它所指出的,答案有点像破解.
A related, but more complicated, question was asked here:Multiindexed Pandas groupby, ignore a level?As it states, the answer is kind of a hack.
有人知道更好的解决方案吗?如果不是,那么id2的每个值都是唯一的特殊情况呢?
Does anyone know a better solution? If not, what about the special case where every value of id2 is unique?
推荐答案
在groupby之后选择完整行的一种方法是使用groupby/transform
构建布尔掩码,然后使用掩码从:
One way to select full rows after a groupby is to use groupby/transform
to build a boolean mask and then use the mask to select the full rows from s
:
In [110]: s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
Out[110]:
id1 id2
1 a 100
2 a 2000
3 d 20
Name: s, dtype: int64
另一种在某些情况下(例如,当有很多组时)更快的方法是将最大值m
与s
中的值合并到DataFrame中,然后选择行基于m
和s
之间的相等性:
Another way, which is faster in some cases -- such as when there are a lot of groups -- is to merge the max values m
into a DataFrame along with the values in s
, and then select rows based on equality between m
and s
:
def using_merge(s):
m = s.groupby(level=0).agg(np.max)
df = s.reset_index(['id2'])
df['m'] = m
result = df.loc[df['s']==df['m']]
del result['m']
result = result.set_index(['id2'], append=True)
return result['s']
下面是显示using_merge
的示例,虽然更复杂,但可能比using_transform
快:
Here is an example showing using_merge
, while more complicated, may be faster than using_transform
:
import numpy as np
import pandas as pd
def using_transform(s):
return s[s.groupby(level=0).transform(lambda x: x == x.max()).astype(bool)]
N = 10**5
id1 = np.random.randint(100, size=N)
id2 = np.random.choice(list('abcd'), size=N)
index = pd.MultiIndex.from_arrays([id1, id2])
ss = pd.Series(np.random.randint(100, size=N), index=index)
ss.index.names = ['id1', 'id2']
ss.name = 's'
使用IPython的%timeit
函数对这两个函数进行计时会产生:
Timing these two functions using IPython's %timeit
function yields:
In [121]: %timeit using_merge(ss)
100 loops, best of 3: 12.8 ms per loop
In [122]: %timeit using_transform(ss)
10 loops, best of 3: 45 ms per loop
这篇关于Multiindex Pandas Groupby +聚合,保持完整索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!