用生成器创建一个dask包

用生成器创建一个dask包

本文介绍了用生成器创建一个dask包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从生成器列表中创建一个 dask.Bag (或 dask.Array )。棘手的是,生成器(在评估时)太大而无法存储。

I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory.

delayed_array = [delayed(generator) for generator in list_of_generators]
my_bag = db.from_delayed(delayed_array)

NB list_of_generators 就是-发电机还没有还没被消耗。

NB list_of_generators is exactly that - the generators haven't been consumed (yet).

我的问题是,创建 delayed_array 时,生成器被消耗了,RAM耗尽了。有没有一种方法可以将这些长列表放入 Bag 而不先消耗它们,或者至少以块的形式消耗它们,从而使RAM使用率保持较低水平?

My problem is that when creating delayed_array the generators are consumed and RAM is exhausted. Is there a way to get these long lists into the Bag without first consuming them, or at least consuming them in chunks so RAM use is kept low?

NNB我可以将生成器写入磁盘,然后将文件加载到 Bag -但是我想我也许可以使用 dask 解决这个问题?

NNB I could write the generators to disk, and then load the files into the Bag - but I thought I might be able to use dask to get around this?

推荐答案

Dask.bag的一个不错的子集可以与大型迭代器一起使用。您的解决方案几乎是完美的,但是您需要提供一个在调用时创建生成器的函数,而不是生成器本身。

A decent subset of Dask.bag can work with large iterators. Your solution is almost perfect, but you'll need to provide a function that creates your generators when called rather than the generators themselves.

In [1]: import dask.bag as db

In [2]: import dask

In [3]: b = db.from_delayed([dask.delayed(range)(i) for i in [100000000] * 5])

In [4]: b
Out[4]: dask.bag<bag-fro..., npartitions=5>

In [5]: b.take(5)
Out[5]: (0, 1, 2, 3, 4)

In [6]: b.sum()
Out[6]: <dask.bag.core.Item at 0x7f852d8737b8>

In [7]: b.sum().compute()
Out[7]: 24999999750000000

但是,肯定有一些方法可以咬你。一些稍微复杂些的轻快行李操作确实需要使分区具体化,这可能会炸毁RAM。

However, there are certainly ways that this can bite you. Some slightly more complex dask bag operations do need to make partitions concrete, which could blow out RAM.

这篇关于用生成器创建一个dask包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 14:08