问题描述
我遇到的问题与,但又有足够的区别,以至于不能用相同的解决方案来解决...
I have a problem that is similar to this question, but just different enough that it can't be solved with the same solution...
我有两个数据框,分别是df1
和df2
:
I've got two dataframes, df1
and df2
, like this:
import pandas as pd
import numpy as np
np.random.seed(42)
names = ['jack', 'jill', 'jane', 'joe', 'ben', 'beatrice']
df1 = pd.DataFrame({'ID_a':np.random.choice(names, 20), 'ID_b':np.random.choice(names,20)})
df2 = pd.DataFrame({'ID':names})
>>> df1
ID_a ID_b
0 joe ben
1 ben jack
2 jane joe
3 ben jill
4 ben beatrice
5 jill ben
6 jane joe
7 jane jack
8 jane jack
9 ben jane
10 joe jane
11 jane jill
12 beatrice joe
13 ben joe
14 jill beatrice
15 joe beatrice
16 beatrice beatrice
17 beatrice jane
18 jill joe
19 joe joe
>>> df2
ID
0 jack
1 jill
2 jane
3 joe
4 ben
5 beatrice
我想做的是在df2
中添加一列,并在df1
中添加 count 行,在其中中都可以找到给定名称列ID_a
或ID_b
,结果如下:
What I'd like to do is add in a column to df2
, with the count of rows in df1
where the given name can be found in either column ID_a
or ID_b
, resulting in this:
>>> df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
此循环满足了我的需要,但是对于大型数据框而言效率不高,如果有人可以提出替代的更好的解决方案,我将不胜感激:
This loop gets what I need, but is inefficient for large dataframes, and if someone could suggest an alternative, nicer solution, I'd be very grateful:
df2['count'] = 0
for idx,row in df2.iterrows():
df2.loc[idx, 'count'] = len(df1[(df1.ID_a == row.ID) | (df1.ID_b == row.ID)])
提前谢谢!
推荐答案
任一"部分使事情复杂,但仍应可行.
The "either" part complicates things, but should still be doable.
选项1
由于其他用户决定将其变成速度竞赛,因此这是我的:
Option 1
Since other users decided to turn this into a speed-race, here's mine:
from collections import Counter
from itertools import chain
c = Counter(chain.from_iterable(set(x) for x in df1.values.tolist()))
df2['count'] = df2['ID'].map(Counter(c))
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
176 µs ± 7.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
选项2
(原始答案)基于stack
Option 2
(Original answer) stack
based
c = df1.stack().groupby(level=0).value_counts().count(level=1)
或者,
c = df1.stack().reset_index(level=0).drop_duplicates()[0].value_counts()
或者,
v = df1.stack()
c = v.groupby([v.index.get_level_values(0), v]).count().count(level=1)
# c = v.groupby([v.index.get_level_values(0), v]).nunique().count(level=1)
然后
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
选项3
基于repeat
的重塑和计数
Option 3repeat
-based Reshape and counting
v = pd.DataFrame({
'i' : df1.values.reshape(-1, ),
'j' : df1.index.repeat(2)
})
c = v.loc[~v.duplicated(), 'i'].value_counts()
df2['count'] = df2.ID.map(c)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
选项4 concat
+ mask
Option 4concat
+ mask
v = pd.concat(
[df1.ID_a, df1.ID_b.mask(df1.ID_a == df1.ID_b)], axis=0
).value_counts()
df2['count'] = df2.ID.map(v)
df2
ID count
0 jack 3
1 jill 5
2 jane 8
3 joe 9
4 ben 7
5 beatrice 6
这篇关于向量化的方法来计算两列中任一列中的字符串的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!