问题描述
我有一个熊猫数据框。我想通过使用一个列组合并对另一个列组合进行计数来区分它。例如,我有以下数据框:
abcde
0 1 10 100 1000 10000
1 1 10 100 1000 20000
2 1 20 100 1000 20000
3 1 20 100 2000 20000
我可以按列 和 b ,并在 d $ c $列中计数 distinct c>:
df.groupby(['a','b'])['d']。nunique() .reset_index()
结果我得到:
abd
0 1 10 1
1 1 20 2
但是,我想要在列组合中对不同的值进行计数。例如,如果我使用 c 和 d ,那么在第一组中我只有一个唯一组合((100,1000)),而在第二组中,我有两个不同的组合:(100,1000)和(100,2000)。
以下朴素的泛化不起作用:
df.groupby(['a','b'])[['c','d']]。nunique()。reset_index()
因为 nunique()不适用于数据框。
您可以将转换为字符串的值的组合创建为新列 e ,然后使用提供的另一种解决方案:
def f(x):
a = x.values
c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np。 void,a.dtype.itemsize * a.shape [1]))),return_counts = True)[1])$ b $ b return c
print(df.groupby(['a','b'])[['c','d']]。apply(f))
定时:
# [1000000行×5列]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30,size =(N,5 )))
df.columns = list('abcde')
print(df)
在[354]中:%timeit(df.groupby(['a', 'b'])[['c','d']]。apply(lambda g:len(g) - g.duplicated()。sum()))
1循环,最好是3:663 ms每循环
在[355]中:%timeit(df.groupby(['a','b'])[['c','d']]。apply(f))
1循环,最好是3:每循环387 ms
在[356]中:%timeit(df.groupby(['a','b','c','d '))。size()。groupby(level = ['a','b'])。size())
1循环,最好是3:每循环441 ms
在[357]中:%timeit((df.c.astype(str)+ df.d.astype(str)).groupby([df.a,df.b])。nunique())
1循环,最好的3:每循环4.95秒
在[358]中:%timeit((df [['c','d']]。ap每一个循环
1个循环,最好是3:17.6 s
I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.
For example I have the following data frame:
a b c d e 0 1 10 100 1000 10000 1 1 10 100 1000 20000 2 1 20 100 1000 20000 3 1 20 100 2000 20000
I can group it by columns a and b and count distinct values in the column d:
df.groupby(['a','b'])['d'].nunique().reset_index()
As a result I get:
a b d 0 1 10 1 1 1 20 2
However, I would like to count distinct values in a combination of columns. For example if I use c and d, then in the first group I have only one unique combination ((100, 1000)) while in the second group I have two distinct combinations: (100, 1000) and (100, 2000).
The following naive "generalization" does not work:
df.groupby(['a','b'])[['c','d']].nunique().reset_index()
because nunique() is not applicable to data frames.
You can create combination of values converting to string to new column e and then use SeriesGroupBy.nunique:
df['e'] = df.c.astype(str) + df.d.astype(str) df = df.groupby(['a','b'])['e'].nunique().reset_index() print (df) a b e 0 1 10 1 1 1 20 2
You can also use Series without creating new column:
df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f') print (df) a b f 0 1 10 1 1 1 20 2
Another posible solution is create tuples:
df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f') print (df) a b f 0 1 10 1 1 1 20 2
Another numpy solution by this answer:
def f(x): a = x.values c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1]) return c print (df.groupby(['a','b'])[['c','d']].apply(f))
Timings:
#[1000000 rows x 5 columns] np.random.seed(123) N = 1000000 df = pd.DataFrame(np.random.randint(30, size=(N,5))) df.columns = list('abcde') print (df) In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum())) 1 loop, best of 3: 663 ms per loop In [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f)) 1 loop, best of 3: 387 ms per loop In [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size()) 1 loop, best of 3: 441 ms per loop In [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique()) 1 loop, best of 3: 4.95 s per loop In [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique()) 1 loop, best of 3: 17.6 s per loop
这篇关于如何在用 pandas 进行分组的同时对列中的不同值进行计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!