如何在用 pandas 进行分组的同时对列中的不同值进行计数？

本文介绍了如何在用 pandas 进行分组的同时对列中的不同值进行计数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个熊猫数据框。我想通过使用一个列组合并对另一个列组合进行计数来区分它。

例如，我有以下数据框：

  abcde 
 0 1 10 100 1000 10000 
 1 1 10 100 1000 20000 
 2 1 20 100 1000 20000 
 3 1 20 100 2000 20000

我可以按列和 b ，并在 d distinct c>：

  df.groupby（['a'，'b']）['d']。nunique（） .reset_index（）

结果我得到：

  abd 
 0 1 10 1 
 1 1 20 2

但是，我想要在列组合中对不同的值进行计数。例如，如果我使用 c 和 d ，那么在第一组中我只有一个唯一组合（（100,1000）），而在第二组中，我有两个不同的组合：（100，1000）和（100，2000）。

以下朴素的泛化不起作用：

  df.groupby（['a'，'b']）[['c'，'d']]。nunique（）。reset_index（）

因为 nunique（）不适用于数据框。

解决方案

您可以将转换为字符串的值的组合创建为新列 e ，然后使用提供的另一种解决方案：

  def f（x）：
a = x.values 
c = len（np.unique（np.ascontiguousarray（a）.view（np.dtype（（np。 void，a.dtype.itemsize * a.shape [1]））），return_counts = True）[1]）$ b $ b return c 
 
 print（df.groupby（['a'，'b']）[['c'，'d']]。apply（f））

定时：

 ＃ [1000000行×5列] 
 np.random.seed（123）
 N = 1000000 
 df = pd.DataFrame（np.random.randint（30，size =（N，5 ）））
 df.columns = list（'abcde'）
 print（df）
 
在[354]中：％timeit（df.groupby（['a'， 'b']）[['c'，'d']]。apply（lambda g：len（g） -  g.duplicated（）。sum（）））
 1循环，最好是3：663 ms每循环
 
在[355]中：％timeit（df.groupby（['a'，'b']）[['c'，'d']]。apply（f）） 
 1循环，最好是3：每循环387 ms 
 
在[356]中：％timeit（df.groupby（['a'，'b'，'c'，'d '））。size（）。groupby（level = ['a'，'b']）。size（））
 1循环，最好是3：每循环441 ms 
 
在[357]中：％timeit（（df.c.astype（str）+ df.d.astype（str））.groupby（[df.a，df.b]）。nunique（））
 1循环，最好的3：每循环4.95秒
 
在[358]中：％timeit（（df [['c'，'d']]。ap每一个循环
 1个循环，最好是3：17.6 s

I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.

For example I have the following data frame:

   a   b    c     d      e
0  1  10  100  1000  10000
1  1  10  100  1000  20000
2  1  20  100  1000  20000
3  1  20  100  2000  20000

I can group it by columns a and b and count distinct values in the column d:

df.groupby(['a','b'])['d'].nunique().reset_index()

As a result I get:

   a   b  d
0  1  10  1
1  1  20  2

However, I would like to count distinct values in a combination of columns. For example if I use c and d, then in the first group I have only one unique combination ((100, 1000)) while in the second group I have two distinct combinations: (100, 1000) and (100, 2000).

The following naive "generalization" does not work:

df.groupby(['a','b'])[['c','d']].nunique().reset_index()

because nunique() is not applicable to data frames.

解决方案

You can create combination of values converting to string to new column e and then use SeriesGroupBy.nunique:

df['e'] = df.c.astype(str) + df.d.astype(str)
df = df.groupby(['a','b'])['e'].nunique().reset_index()
print (df)
   a   b  e
0  1  10  1
1  1  20  2

You can also use Series without creating new column:

df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another posible solution is create tuples:

df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another numpy solution by this answer:

def f(x):
    a = x.values
    c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1])
    return c

print (df.groupby(['a','b'])[['c','d']].apply(f))

Timings:

#[1000000 rows x 5 columns]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30, size=(N,5)))
df.columns = list('abcde')
print (df)

In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum()))
1 loop, best of 3: 663 ms per loop

In [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f))
1 loop, best of 3: 387 ms per loop

In [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size())
1 loop, best of 3: 441 ms per loop

In [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 4.95 s per loop

In [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 17.6 s per loop

这篇关于如何在用 pandas 进行分组的同时对列中的不同值进行计数？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！