本文介绍了dask DataFrame的dask DataFrame等效sort_values的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于dask DataFrame,pandas中的sort_values等价于什么?我正在尝试扩展一些存在内存问题的Pandas代码,以改用dask DataFrame.

What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead.

等价于:

ddf.set_index([col1, col2], sorted=True)

?

推荐答案

并行排序很难.您在Dask.dataframe中有两个选择

Sorting in parallel is hard. You have two options in Dask.dataframe

像现在一样,您可以使用列索引调用set_index:

As now, you can call set_index with a single column index:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf.set_index('x').compute()
Out[5]:
   y
x
1  c
2  b
3  a

Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes

In [6]: ddf.set_index(['x', 'y']).compute()
NotImplementedError: Dask dataframe does not yet support multi-indexes.
You tried to index with this index: ['x', 'y']
Indexes must be single columns only.

最大

考虑到您如何表达您的问题,我怀疑这不适用于您,但是通常情况下,使用排序的情况可以通过便宜得多的解决方案来解决.最大的.

In [7]: ddf.x.nlargest(2).compute()
Out[7]:
0    3
1    2
Name: x, dtype: int64

In [8]: ddf.nlargest(2, 'x').compute()
Out[8]:
   x  y
0  3  a
1  2  b

这篇关于dask DataFrame的dask DataFrame等效sort_values的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-27 14:50
查看更多