问题描述
我希望对一个dask数据框应用一个 lambda
函数,以更改列中的标签(如果其小于一定百分比)。我使用的方法对pandas数据框效果很好,但是相同的代码对dask数据框无效。代码如下。
I am looking to apply a lambda
function to a dask dataframe to change the lables in a column if its less than a certain percentage. The method that I am using works well for a pandas dataframe but the same code does not work for dask a dataframe. The code is below.
df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']})
ddf = dd.from_pandas(df, npartitions=2)
df:
输出:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
ddf.compute()
输出:
A B C
0 ant cat dog
1 ant peach dog
2 cherry cat roo
3 bee cat emu
4 ant peach emu
list_ = ['B','C']
df.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x)
输出:
A B C
0 ant cat dog
1 ant peach dog
2 other cat roo
3 other cat emu
4 ant peach emu
对dask数据框执行相同操作:
Do the same for dask dataframe:
ddf.apply(lambda x: x.mask(x.map(x.value_counts(normalize=True))<.5, 'other') if x.name not in list_ else x,axis=1).compute()
输出(给出警告,而不是所需的输出) :
output(gives warning and not the output required):
/home/michael/env/lib/python3.5/site-packages/dask/dataframe/core.py:3107: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
A B C
0 other other other
1 other other other
2 other other other
3 other other other
4 other other other
有人可以帮助我获得dask数据框实例的所需输出。
Could someone be able to help me out to get the required output for the dask dataframe instance.
谢谢
Michael
推荐答案
在大熊猫案件中您执行的操作不同:对于后者,您有 axis = 1
,因此您最终替换了在给定的行中出现的少于两次的所有值。
You are not performing the same thing in the pandas and dask cases: for the latter you have axis=1
, so you end up replacing any value which occurs less than twice in a given row, which is all of them.
如果更改为 axis = 0
,则会看到异常。这是因为要计算第一个分区,您还需要将整个数据帧也传递给lambda函数-否则如何获得value_counts?
If you change to axis=0
, you will see that you get an exception. This is because to compute, say, the first partition, you would need the whole dataframe also to be passed to the lambda function - else how could you get the value_counts?
解决您的问题的方法是分别获取价值计数。您可以显式计算此结果(结果很小)或将其传递给lambda。还要注意,走这条路意味着您可以避免使用 apply
来支持 map
并使内容更明确。
The solution to your problem would be to get the value counts separately. You could explicitly compute this (the result is small) or pass it to the lambda. Note furthermore that going this path means you can avoid using apply
in favour of map
and making things more explicit. Here I am exclusively picking the one column, you could loop.
vc = ddf.A.value_counts().compute()
vc /= vc.sum() # because dask's value_count doesn't normalise
def simple_map(df):
df['A'] = df['A'].map(lambda x: x if vc[x] > 0.5 else 'other')
return df
ddf.map_partitions(simple_map, meta=df[:0]).compute()
这篇关于将lambda函数应用于dask数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!