c# - 返回CUDA中数组的最小和最大元素

我正在CUDA中执行一些数组操作/计算（通过Cudafy.NET library，尽管我对CUDA / C ++方法同样感兴趣），并且需要计算数组中的最小值和最大值。其中一个内核如下所示：

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;
    }

我想做这样的事情：

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez, out float min, out float max)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        min = float.MaxValue;
        max = float.MinValue;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
        {
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;

            min = Math.Min(ez[i, j], min);
            max = Math.Max(ez[i, j], max);

        }
    }

有谁知道返回最小值和最大值的简便方法（对于整个数组，而不仅仅是针对每个线程或块）？

最佳答案

根据您对问题的评论，您试图在计算最大值和最小值时找到它们。尽管有可能，但这并不是最有效的。如果您打算这样做，则可以与一些全局最小值和全局最大值进行原子比较，不利之处在于每个线程都将被序列化，这可能是一个很大的瓶颈。

对于通过归约法在数组中找到最大值或最小值的更规范的方法，可以按照以下方式进行操作：

#define MAX_NEG ... //some small number

template <typename T, int BLKSZ> __global__
void cu_max_reduce(const T* d_data, const int d_len, T* max_val)
{
    volatile __shared__ T smem[BLKSZ];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
        //starting index for each block to begin loading the input data into shared memory
    const int bid_sidx = bid*BLKSZ;

    //load the input data to smem, with padding if needed. each thread handles 2 elements
    #pragma unroll
    for (int i = 0; i < 2; i++)
    {
                //get the index for the thread to load into shared memory
        const int tid_idx = 2*tid + i;
        const int ld_idx = bid_sidx + tid_idx;
        if(ld_idx < (bid+1)*BLKSZ && ld_idx < d_len)
            smem[tid_idx] = d_data[ld_idx];
        else
            smem[tid_idx] = MAX_NEG;

        __syncthreads();
    }

    //run the reduction per-block
    for (unsigned int stride = BLKSZ/2; stride > 0; stride >>= 1)
    {
        if(tid < stride)
        {
            smem[tid] = ((smem[tid] > smem[tid + stride]) ? smem[tid]:smem[tid + stride]);
        }
        __syncthreads();
    }

    //write the per-block result out from shared memory to global memory
    max_val[bid] = smem[0];
}


//assume we have d_data as a device pointer with our data, of length data_len
template <typename T> __host__
T cu_find_max(const T* d_data, const int data_len)
{
    //in your host code, invoke the kernel with something along the lines of:
    const int thread_per_block = 16;
    const int elem_per_thread = 2;
    const int BLKSZ = elem_per_thread*thread_per_block; //number of elements to process per block
    const int blocks_per_grid = ceil((float)data_len/(BLKSZ));

    dim3 block_dim(thread_per_block, 1, 1);
    dim3 grid_dim(blocks_per_grid, 1, 1);

    T *d_max;
    cudaMalloc((void **)&d_max, sizeof(T)*blocks_per_grid);

    cu_max_reduce <T, BLKSZ> <<<grid_dim, block_dim>>> (d_data, data_len, d_max);

    //etc....
}

这将找到每个块的最大值。您可以在1块的输出上再次运行它（例如，使用d_max作为输入数据，并使用更新的启动参数）以找到全局最大值-如果数据集太大，则需要以这种方式进行多遍运行（例如，在这种情况下，大于2 * 4096个元素，因为每个线程处理2个元素，尽管您可以为每个线程处理更多元素以增加此数量）。

我应该指出，这并不是特别有效（您希望在加载共享内存时使用更智能的跨步来避免存储区冲突），而且我不是100％确信它是正确的（它可以解决一些小问题）我尝试过的测试用例），但为了最大程度地清晰起见，我尝试编写它。同样不要忘了输入一些错误检查代码来确保您的CUDA调用成功完成，我在这里省略了它们以使其简短。

我还应指导您阅读一些更深入的文档；您可以在http://docs.nvidia.com/cuda/cuda-samples/index.html处查看CUDA样本减少量，尽管它没有进行最小/最大计算，但它是相同的基本思想（且效率更高）。另外，如果您正在寻找简单性，则可能只想使用Thrust的函数thrust::max_element和thrust::min_element，以及以下位置的文档：推力.github.com / doc / group__extrema.html

关于c# - 返回CUDA中数组的最小和最大元素，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/15747519/