将lambda函数应用于dask数据框

将lambda函数应用于dask数据框

本文介绍了将lambda函数应用于dask数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望对一个dask数据框应用一个 lambda 函数,以更改列中的标签(如果其小于一定百分比)。我使用的方法对pandas数据框效果很好,但是相同的代码对dask数据框无效。代码如下。

I am looking to apply a lambda function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below.

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)

df:

输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu



ddf.compute()

输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   cherry cat   roo
3   bee    cat   emu
4   ant    peach emu



list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)

输出:

     A     B      C
0   ant    cat   dog
1   ant    peach dog
2   other  cat   roo
3   other  cat   emu
4   ant    peach emu

对dask数据框执行相同操作:

Do the same for dask dataframe:

ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()

输出(给出警告,而不是所需的输出) :

output(gives warning and not the output required):

/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
      A       B       C
0   other   other   other
1   other   other   other
2   other   other   other
3   other   other   other
4   other   other   other

有人可以帮助我获得dask数据框实例的所需输出。

Could someone be able to help me out to get the required output for the dask dataframe instance.

谢谢

Michael

推荐答案

在大熊猫案件中您执行的操作不同:对于后者,您有 axis = 1 ,因此您最终替换了在给定的中出现的少于两次的所有值。

You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1, so you end up replacing any value which occurs less than twice in a given row, which is all of them.

如果更改为 axis = 0 ,则会看到异常。这是因为要计算第一个分区,您还需要将整个数据帧也传递给lambda函数-否则如何获得value_counts?

If you change to axis=0, you will see that you get an exception. This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?

解决您的问题的方法是分别获取价值计数。您可以显式计算此结果(结果很小)或将其传递给lambda。还要注意,走这条路意味着您可以避免使用 apply 来支持 map 并使内容更明确。

The solution to your problem would be to get the value counts separately. You could explicitly compute this (the result is small) or pass it to the lambda. Note furthermore that going this path means you can avoid using apply in favour of map and making things more explicit. Here I am exclusively picking the one column, you could loop.

vc = ddf.A.value_counts().compute()
vc /= vc.sum()  # because dask's value_count doesn't normalise

def simple_map(df):
    df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
    return df

ddf.map_partitions(simple_map, meta=df[:0]).compute()

这篇关于将lambda函数应用于dask数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:23