代码段:from dask import dataframe as ddimport numpy as npimport pandas as pddf = pd.DataFrame({'A': np.arange(5), 'B': np.arange(5), 'C': np.arange(5)})ddf = dd.from_pandas(df, npartitions=1)def aggregate(x): print('B val received: ' + str(x.B)) return xddf.apply(aggregate, axis=1).compute()但是运行上面的代码时,我看到的却是:B val received: 1B val received: 1B val received: 1B val received: 0B val received: 0B val received: 1B val received: 2B val received: 3B val received: 4我看到首先打印的是一系列1,而不是0-4.我注意到,每次设置Dask DataFrame并运行apply时,值1的额外"行都会出现.对其进行操作.打印数据框在整个过程中都没有显示其他值为1的行: A B C0 0 0 01 1 1 12 2 2 23 3 3 34 4 4 4我的问题是:这些值1的行从何而来?为什么它们似乎始终出现在数据框中的实际"行之前? 1个值似乎与实际行中的值无关(也就是说,由于某种原因,并不是因为第二个行多抓了几次).解决方案在尝试对整个分区集合进行尝试之前,Dask会对其进行指示进行检查.那就是前几个打印语句的来源.这是内置错误检查的一部分,可以防止Dask进行一系列冗长的操作并最终导致失败.In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations.Code snippet:from dask import dataframe as ddimport numpy as npimport pandas as pddf = pd.DataFrame({'A': np.arange(5), 'B': np.arange(5), 'C': np.arange(5)})ddf = dd.from_pandas(df, npartitions=1)def aggregate(x): print('B val received: ' + str(x.B)) return xddf.apply(aggregate, axis=1).compute()But when the above code is run, I see this instead:B val received: 1B val received: 1B val received: 1B val received: 0B val received: 0B val received: 1B val received: 2B val received: 3B val received: 4Instead of 0 - 4, I see a series of 1 printed first, and an extra 0. I have noticed the "extra" rows of value 1 occurring every time I have set up a Dask DataFrame and run an apply operation on it.Printing the dataframe shows no additional rows with value 1 throughout: A B C0 0 0 01 1 1 12 2 2 23 3 3 34 4 4 4My question is: Where are these rows with value 1 coming from? Why do they appear to consistently occur prior to the "actual" rows in the dataframe? The 1 values seem unrelated to the values in the actual rows (that is, it is not as though it is for some reason grabbing the second row an extra few times). 解决方案 Dask does some checking on what you have told it to do before it tries to do it on the entire collection of partitions. That is where the first few print statements are coming from. It's part of the built in error checking that prevents Dask from going down some long winded series of operations and failing at the end. 这篇关于在Dask DataFrame.apply()上,在处理实际行之前接收n值为1的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-29 09:44