python - 对于“大量”的组，在Pandas数据框中进行过滤的速度较慢吗？

我有一个约200k行的数据框，我尝试按以下方式进行过滤：

>>> df.groupby(key).filter(lambda group: len(group) > 100)

其中key是列的列表。当指定的键将数据帧分为800个左右的组时，此过程将在3秒钟内运行。但是，如果我在键中添加另一列，将组数增加到2500个左右，那么执行将占用我的所有内存，并且除非我终止脚本，否则基本上会使系统崩溃。

我可以通过遍历各个组来执行相同的操作，但是与上述单行代码相比，它很笨拙，这使我想知道为什么过滤器功能如此有限。

有人可以向我解释这是否可以预料，如果可以，为什么？

谢谢！

最佳答案

这在某种程度上取决于组的数量，但是必须为您做些其他事情。这非常快。

In [10]: N = 1000000

In [11]: ngroups = 1000

In [12]: df = DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))

In [13]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loops, best of 3: 431 ms per loop

In [14]: df.groupby('A').filter(lambda x: len(x) > 1000).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 508918 entries, 0 to 999997
Data columns (total 2 columns):
A    508918 non-null int64
B    508918 non-null float64
dtypes: float64(1), int64(1)
In [15]: df = DataFrame(dict(A = np.random.randint(0,10,size=N),B=np.random.randn(N)))

In [16]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
1 loops, best of 3: 182 ms per loop

In [17]: df.groupby('A').filter(lambda x: len(x) > 1000).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A    1000000 non-null int64
B    1000000 non-null float64
dtypes: float64(1), int64(1)