本文介绍了选择值大于pandas中另一列的所有列名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在pandas数据框中查找每个列的列名称,其中该值大于另一个列的值.

I'm trying to find the column names of each column in a pandas dataframe where the value is greater than that of another column.

例如,如果我具有以下数据框:

For example, if I have the following dataframe:

   A  B  C  D  threshold
0  1  3  3  1  2
1  2  3  6  1  5
2  9  5  0  2  4

对于每一行,我想返回值大于阈值的列的名称,因此我将:

For each row I would like to return the names of the columns where the values are greater than the threshold, so I would have:

0: B, C
1: C
2: A, B

任何帮助将不胜感激!

推荐答案

如果要大幅提高速度,可以使用NumPy的矢量化where函数.

If you want a large increase in speed you can use NumPy's vectorized where function.

s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
pd.Series([''.join(x).strip(', ') for x in s])

0    B, C
1       C
2    A, B
dtype: object

使用100,000行的数据帧时,与@jezrael和MaxU解决方案相比,加速的幅度超过一个数量级.在这里,我首先创建测试DataFrame.

There is more than an order of magnitude speedup vs @jezrael and MaxU solutions when using a dataframe of 100,000 rows. Here I create the test DataFrame first.

n = 100000
df = pd.DataFrame(np.random.randint(0, 10, (n, 5)),
                  columns=['A', 'B', 'C', 'D', 'threshold'])

时间

%%timeit
>>> s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
>>> pd.Series([''.join(x).strip(', ') for x in s])
280 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
>>> df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
>>> df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
3.15 s ± 82.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
>>> x = df.drop('threshold',1)
>>> x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
3.28 s ± 145 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这篇关于选择值大于pandas中另一列的所有列名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 19:37