本文介绍了快速获取pandas数据帧中每列的top-k元素的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的pandas数据框,大约有500,000列。每列长约500个元素。对于每一列,我需要检索列中top-k元素的(索引,列)位置。

I have a very large pandas dataframe with approximately 500,000 columns. Each column is about 500 elements long. For each column, I need to retrieve the (index, column) location of the top-k elements in the column.

因此,如果k等于2,则这是我的数据框:

So, if k were equal to 2, and this were my data frame:

  A  B  C  D
w 4  8  10 2
x 5  1  1  6
y 9  22 25 7
z 15 5  7  2

我想要回复:

[(A,y),(A,z),(B,w),(B,y),(C,w),(C,y),(D,x),(D,y)]

请记住,我有大约500,000列,所以速度是我的主要关注点。是否有合理的方法可以在我的机器上花费整整一周的时间?什么是最快的方式 - 即使它对我的数据量足够快?

Keep in mind that I have approximately 500,000 columns, so speed is my primary concern. Is there a reasonable way of doing this that will not take an entire week on my machine? What is the fastest way - even if it will be fast enough for the amount of data I have?

感谢您的帮助!

推荐答案

我认为 numpy 有一个很好的解决方案您可以根据需要格式化输出。

I think numpy has a good solution for this that's fast and you can format the output however you want.

In [2]: df = pd.DataFrame(data=np.random.randint(0, 1000, (200, 500000)),
                      columns=range(500000), index=range(200))

In [3]: def top_k(x,k):
             ind=np.argpartition(x,-1*k)[-1*k:]
             return ind[np.argsort(x[ind])]

In [69]: %time np.apply_along_axis(lambda x: top_k(x,2),0,df.as_matrix())
CPU times: user 5.91 s, sys: 40.7 ms, total: 5.95 s
Wall time: 6 s

Out[69]:
array([[ 14,  54],
       [178, 141],
       [ 49, 111],
       ...,
       [ 24, 122],
       [ 55,  89],
       [  9, 175]])

与熊猫解决方案相比,速度相当快(IMO更干净,但我们的速度更快):

Pretty fast compared to the pandas solution (which is cleaner IMO but we're going for speed here):

In [41]: %time np.array([df[c].nlargest(2).index.values for c in df])
CPU times: user 3min 43s, sys: 6.58 s, total: 3min 49s
Wall time: 4min 8s

Out[41]:
array([[ 54,  14],
       [141, 178],
       [111,  49],
       ...,
       [122,  24],
       [ 89,  55],
       [175,   9]])

列表的顺序相反(您可以通过在 numpy中反转排序来轻松解决此问题版本)

The lists are in reverse order of each other (you can easily fix this by reversing sort in the numpy version)

请注意,在示例中由于随机int生成,我们可能有超过 k 的值是相等和最大所以返回的索引可能不同意所有方法,但都会产生有效的结果(你将得到 k 与列中的最大值匹配的索引)

Note that in the example due to random int generation we can likely have more than k values that are equal and max so indices returned may not agree among all methods but all will yield a valid result (you will get k indices that match the max values in the column)

这篇关于快速获取pandas数据帧中每列的top-k元素的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:15