沿dask数组的轴应用函数 | 沿dask数组的轴应用函数

本文介绍了沿dask数组的轴应用函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从气候模型仿真中分析海洋温度数据，其中4D数据阵列（时间，深度，纬度，经度；下面用 dask_array 表示）通常具有形状（6000，31，189，192），大小约为25GB（因此我想使用dask；我在尝试使用numpy处理这些数组时遇到内存错误）。

I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy).

我需要沿着时间轴在每个级别/纬度/经度点处拟合一个三次多项式，并存储得到的4个系数。因此，我将 chunksize =（6000，1，1，1）设置为每个网格点都有一个单独的块。

I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point.

这是我获取三次多项式系数的函数（ time_axis 轴值是在其他地方定义的全局一维numpy数组）：

This is my function for getting the coefficients of the cubic polynomial (the time_axis axis values are a global 1D numpy array defined elsewhere):

def my_polyfit(data):
    return numpy.polyfit(data.squeeze(), time_axis, 3)

（因此，在这种情况下， numpy.polyfit 返回一个长度列表4）

(So in this case, numpy.polyfit returns a list of length 4)

这是我认为需要将其应用于每个块的命令：

and this is the command I thought I'd need to apply it to each chunk:

dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()

时间轴现在消失了（因此 drop_axis = 0 ），并且有一个新的系数轴

Whereby the time axis is now gone (hence drop_axis=0) and there's a new coefficient axis in it's place (of length 4).

当我运行此命令时，我得到 IndexError：元组索引超出范围，所以我想知道哪里/何我误解了 map_blocks 的用法？

When I run this command I get IndexError: tuple index out of range, so I'm wondering where/how I've misunderstood the use of map_blocks?

推荐答案

我怀疑如果您的函数返回与其使用的维相同的数组，您的体验将更加流畅。例如。您可以考虑按以下方式定义函数：

I suspect that your experience will be smoother if your function returns an array of the same dimension that it consumes. E.g. you might consider defining your function as follows:

def my_polyfit(data):
    return np.polyfit(data.squeeze(), ...)[:, None, None, None]

那么您可能忽略 new_axis ， drop_axis 位。

性能-明智的是，您可能还需要考虑使用更大的块大小。每块6000个数字，您将拥有超过一百万个块，这意味着与实际计算相比，您可能会花费更多的时间进行调度。通常，我会拍摄几兆字节的块。当然，增加块大小会导致映射函数变得更加复杂。

Performance-wise you might also want to consider using a larger chunksize. At 6000 numbers per chunk you have over a million chunks, which means you'll probably spend more time in scheduling than in actual computation. Generally I shoot for chunks that are a few megabytes in size. Of course, increasing chunksize would cause your mapped function to become more complex.

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: def f(b):
    return np.polyfit(b.squeeze(), np.arange(5), 3)[:, None, None, None]
   ...:

In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))

In [5]: x.map_blocks(f, chunks=(4, 1, 1, 1)).compute()
Out[5]:
array([[[[ -1.29058580e+02,   2.21410738e+02,   1.00721521e+01],
         [ -2.22469851e+02,  -9.14889627e+01,  -2.86405832e+02],
         [  1.40415805e+02,   3.58726232e+02,   6.47166710e+02]],
         ...

这篇关于沿dask数组的轴应用函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！