python - 创建前n个值的数据框的更有效方法-Python

我有一个类别的数据框，需要通过将值限制为前n个类别来清理。不在前n个类别中的任何值都应装在0（或“其他”）之下。

我尝试了下面的代码，该代码循环遍历一列的每一行，然后遍历数据帧中的每一列，以检查是否在该列的前n个value_counts中找到该位置的值。如果是，则保留该值；如果不是，则将其替换为0。

从技术上讲，此实现有效，但是当行数很大时，运行时间太长。在pandas / numpy中完成此操作的更快方法是什么？

z = pd.DataFrame(np.random.randint(1,4,size=(100000, 4)))
x=pd.DataFrame()
n=10
for j in z:
    for i in z[j].index:
        if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
            x.at[i,j] = z.at[i,j]
        else:
            x.at[i,j]= 0
print(x)

最佳答案

我认为您可以将apply用于具有自定义功能的循环列，将value_counts用于顶级值，将where与isin用作布尔掩码以进行替换：

def f(x):
    y = x.value_counts().head(n).index
    return x.where(x.isin(y), 0)

print (z.apply(f))

等同于：

print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))

样品：

#N =100000
N = 10
np.random.seed(123)
z = pd.DataFrame(np.random.randint(1,4,size=(N, 4)))
print (z)
   0  1  2  3
0  3  2  3  3
1  1  3  3  2
2  3  2  3  2
3  1  2  3  2
4  1  3  1  2
5  3  2  1  1
6  1  1  2  3
7  1  3  1  1
8  2  1  2  1
9  1  1  3  2

x=pd.DataFrame()
n=2
for j in z:
    for i in z[j].index:
        if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
            x.at[i,j] = z.at[i,j]
        else:
            x.at[i,j]= 0
print(x)
     0    1    2    3
0  3.0  2.0  3.0  0.0
1  1.0  3.0  3.0  2.0
2  3.0  2.0  3.0  2.0
3  1.0  2.0  3.0  2.0
4  1.0  3.0  1.0  2.0
5  3.0  2.0  1.0  1.0
6  1.0  0.0  0.0  0.0
7  1.0  3.0  1.0  1.0
8  0.0  0.0  0.0  1.0
9  1.0  0.0  3.0  2.0

print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
   0  1  2  3
0  3  2  3  0
1  1  3  3  2
2  3  2  3  2
3  1  2  3  2
4  1  3  1  2
5  3  2  1  1
6  1  0  0  0
7  1  3  1  1
8  0  0  0  1
9  1  0  3  2

numpy.where的类似解决方案：

print (z.apply(lambda x: np.where(x.isin(x.value_counts().head(n).index), x, 0)))
   0  1  2  3
0  3  2  3  0
1  1  3  3  2
2  3  2  3  2
3  1  2  3  2
4  1  3  1  2
5  3  2  1  1
6  1  0  0  0
7  1  3  1  1
8  0  0  0  1
9  1  0  3  2

前N个

python - 创建前n个值的数据框的更有效方法-Python