我有一列数据的值范围为0、1和2。如果这些情况在n行中没有持续存在,我想将1和2的所有实例替换为0。如果值仍然存在,则对于n行,我想保持原样。示例,其中n = 4:
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})
data desired
0 1 0
1 0 0
2 1 0
3 2 0
4 0 0
5 0 0
6 0 0
7 1 0
8 0 0
9 2 2
10 2 2
11 2 2
12 2 2
13 2 2
14 2 2
15 0 0
16 1 0
17 0 0
18 1 0
19 0 0
20 2 0
预期的功能是实质上“过滤”基础数据以实现值的短暂更改。我希望能够声明允许保留基础数据所需的连续值的数量,包括在'n'之上和之下的所有值。 (如果n = 4,并且有6个连续的值,我希望所有6,而不仅仅是2的截止值超过4的持久性。)在熊猫中有矢量化的方法吗?
最佳答案
使用cumsum
的difference
,然后获取每个组的size
:
n = 4
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)
data desired
0 1 0
1 0 0
2 1 0
3 2 0
4 0 0
5 0 0
6 0 0
7 1 0
8 0 0
9 2 2
10 2 2
11 2 2
12 2 2
13 2 2
14 2 2
15 0 0
16 1 0
17 0 0
18 1 0
19 0 0
20 2 0
Timings
:# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape
(1050000, 1)
二凡
%%timeit
n = 4
groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)
268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
广晃
%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()
# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)
164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
海盗
%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()
dfbig.data.where(np.bincount(b)[b] >= 4, 0)
62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conlusion
:海盗
广晃
二凡