问题描述
Pandas具有广泛使用的 groupby 可以根据对应的映射拆分DataFrame的功能,您可以根据该映射对每个子组应用计算并重新组合结果.
Pandas has a widely-used groupby facility to split up a DataFrame based on a corresponding mapping, from which you can apply a calculation on each subgroup and recombine the results.
在没有本地Python for循环的情况下,可以在NumPy中灵活地完成此操作吗?在Python循环中,这看起来像:
Can this be done flexibly in NumPy without a native Python for-loop? With a Python loop, this would look like:
>>> import numpy as np
>>> X = np.arange(10).reshape(5, 2)
>>> groups = np.array([0, 0, 0, 1, 1])
# Split up elements (rows) of `X` based on their element wise group
>>> np.array([X[groups==i].sum() for i in np.unique(groups)])
array([15, 30])
上方15是X
的前三行的总和,而30是其余两行的总和.
Above 15 is the sum of the first three rows of X
, and 30 is the sum of the remaining two.
灵活地",我的意思是我们不是在关注某个特定的计算,例如求和,计数,最大值等,而是将任何计算传递给分组数组.
By "flexibly," I just mean that we aren't focusing on one particular computation such as sum, count, maximum, etc, but rather passing any computation to the grouped arrays.
如果没有,是否有比上述方法更快的方法?
If not, is there a faster approach than the above?
推荐答案
如果您想更灵活地实现groupby
,可以使用numpy
的ufunc
中的任何一个进行分组:
If you want a more flexible implementation of groupby
that can group using any of numpy
's ufunc
s:
def groupby_np(X, groups, axis = 0, uf = np.add, out = None, minlength = 0, identity = None):
if minlength < groups.max() + 1:
minlength = groups.max() + 1
if identity is None:
identity = uf.identity
i = list(range(X.ndim))
del i[axis]
i = tuple(i)
n = out is None
if n:
if identity is None: # fallback to loops over 0-index for identity
assert np.all(np.in1d(np.arange(minlength), groups)), "No valid identity for unassinged groups"
s = [slice(None)] * X.ndim
for i_ in i:
s[i_] = 0
out = np.array([uf.reduce(X[tuple(s)][groups == i]) for i in range(minlength)])
else:
out = np.full((minlength,), identity, dtype = X.dtype)
uf.at(out, groups, uf.reduce(X, i))
if n:
return out
groupby_np(X, groups)
array([15, 30])
groupby_np(X, groups, uf = np.multiply)
array([ 0, 3024])
groupby_np(X, groups, uf = np.maximum)
array([5, 9])
groupby_np(X, groups, uf = np.minimum)
array([0, 6])
这篇关于用NumPy向量化的groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!