python - 如何有效地在dask中使用pandas.cut()(或等效的)？

dask中是否有与pandas.cut（）等效的内容？
我尝试在python中对大型数据集进行bin和分组。它是一个具有特性（位置x、位置y、能量、时间）的测量电子列表。我需要将它按照位置x，位置y分组，然后在能量类中进行分组。
到目前为止，我可以和熊猫一起做，但我想并行运行。所以，我尝试使用dask。
Groupby方法工作得很好，但不幸的是，当我试图将数据存储在能量中时遇到了困难。我找到了一个使用pandas.cut（）的解决方案，但它需要在原始数据集上调用compute（）（将其本质上转换为非并行代码）。在dask中是否有与pandas.cut（）等效的方法，或者是否有其他（优雅的）方法来实现相同的功能？

import dask
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))

# Set the bins to bin along energy
bins = range(0, 10000, 500)

# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)

# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])

# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()

谢谢！

最佳答案

您应该能够执行dd['energy'].map_partitions(pd.cut, bins)。