本文介绍了CUDA减少 - 基础的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将这个代码与数组求和,我被卡住了。我可能需要一些CUDA for dummies tutorial,因为我花了这么多时间与这样的基本操作,我不能让它工作。

I'm trying to sum an array with this code and I am stuck. I probably need some "CUDA for dummies tutorial", because I spent so much time with such basic operation and I can't make it work.

这里是一个列表我不明白的事情或我不确定:

Here is a list of things I don't understand or I'm unsure of:


  1. 我应该使用多少块(dimGrid)?
    我认为应该是 N / dimBlock.x / 2 (N =输入数组的长度),因为在内核的开头,数据被加载和

  1. What number of blocks (dimGrid) should I use?I think that should be N/dimBlock.x/2 (N=length of input array), because at the beginning of the kernel, data are loaded and added to shared memory from two "blocks" of global memory

在原始代码中有 blockSize 。我用 blockDim.x 替换它,因为我不知道这些变量是如何不同的。但是当 blockSize = blockDim.x 时, gridSize = blockDim.x * 2 * gridDim .x 对我来说没有意义 - gridSize 将大于N. * Dim.x和* Size之间的区别是什么一个数组的上下文?

In original code there was blockSize. I replaced it with blockDim.x because I don't know how these variables differ. But when blockSize = blockDim.x, then gridSize = blockDim.x*2*gridDim.x doesn't make sense to me - gridSize will be greater than N. What is the difference between *Dim.x and *Size in a context of 1D array?

主逻辑 - 在内核中,每个块总计2 * dimBlock(块中的线程)数。当N = 262144和dimBlock = 128时,内核返回1024部分和的数组。然后我再次运行内核,result = 4部分和。最后,在上一次运行中,返回单个和,因为数组由单个块处理。

Main logic - in kernel, each block sums 2*dimBlock(threads in block) numbers. When N = 262144 and dimBlock = 128, kernel returns 1024 array of partial sums. Then I run kernel again, result = 4 partial sums. Finally, in last run, single sum is returned, because array is processed by single block.

在第一次运行中,我可以使用 uchar4 输入数据。在第二次和第三次运行中,我将使用 int

I sum binary array. In the first run, I can use uchar4 for input data. In second and third run, I will use int.

告诉我,我失踪了什么

感谢

__global__ void sum_reduction(uchar4* g_idata, int* g_odata, int N) {

extern __shared__ int s_data[];

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + tid;
unsigned int gridSize = blockDim.x*2*gridDim.x;

while (i < N) {
    s_data[tid] += g_idata[i].x + g_idata[i+blockDim.x].x +
            g_idata[i].y + g_idata[i+blockDim.x].y +
            g_idata[i].z + g_idata[i+blockDim.x].z +
            g_idata[i].w + g_idata[i+blockDim.x].w;
    i += gridSize;
}
__syncthreads();

if (tid < 64) {
    s_data[tid] += s_data[tid + 64];
}
__syncthreads();

if (tid < 32) {
    volatile int *s_ptr = s_data;
    s_ptr[tid] += s_ptr[tid + 32];
    s_ptr[tid] += s_ptr[tid + 16];
    s_ptr[tid] += s_ptr[tid + 8];
    s_ptr[tid] += s_ptr[tid + 4];
    s_ptr[tid] += s_ptr[tid + 2];
    s_ptr[tid] += s_ptr[tid + 1];
}
if (tid == 0) {
    g_odata[blockIdx.x] = s_data[0];
}
}


main{
...
dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
sum_reduction<<<dimGrid, dimBlock>>>(in, out, N);
...
}


推荐答案

调用这样的内核解决了这个问题。

Calling the kernel like this fixes the problem.

dim3 dimBlock(128);
dim3 dimGrid(N/dimBlock.x);
int smemSize = dimBlock.x * sizeof(int);
sum_reduction<<<dimGrid, dimBlock, smemSize>>>(in, out, N);

这篇关于CUDA减少 - 基础的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 12:38