本文介绍了CUDA atomicAdd()产生错误结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA新手,第一次使用CUDA内核。
我有以下实现卷积的内核(非常幼稚),并带有一个虚拟循环,该循环在全局内存中执行1000次相同元素的计算(请参见下文)。问题在于,运算之后,结果矩阵中的某些单元格是错误的:从特定偏移量开始,值不是预期的1000的倍数。
我的内核:

I am a CUDA newbie, playing with CUDA kernels for the first time.I've got the following kernel that implements convloution (very naively), with a dummy loop that performs a calculation of the same element 1000 times in global memory (see below). The problem is that after the operation, some cells in the result matrix are wrong: starting at certain offset, the values are not a multiple of 1000 as one would expect.My kernel:

__global__ void conv(float *input, float *kernel, float *target)
{
    for (long i = 0; i <100; i++)
    {
        atomicAdd(target+gridDim.y*blockIdx.x+blockIdx.y,input[(blockIdx.x+threadIdx.x)*(blockDim.y+gridDim.y-1)+(blockIdx.y+threadIdx.y)]*kernel[threadIdx.x*blockDim.y+threadIdx.y]);
    }
}

内核的调用代码如下:

float image[1024] = {0.0};
float kernel[] =
{
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f
};

float res[784]={0};

for (int i = 0; i < 1024; i++)
{
    image[i]=(float)i;
} // Got 32x32 matrix

cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    exit (-1);
}

float *dev_image = 0;
float *dev_kernel = 0;
float *dev_res = 0;

// Allocate GPU buffers for three vectors (two input, one output)    .
cudaStatus = cudaMalloc((void**)&dev_image, sizeof(image));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaStatus = cudaMalloc((void**)&dev_kernel, sizeof(kernel));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaStatus = cudaMalloc((void**)&dev_res, sizeof(res));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaMemcpy(dev_image, image, sizeof(image), cudaMemcpyHostToDevice);
cudaMemcpy(dev_kernel, kernel, sizeof(kernel), cudaMemcpyHostToDevice);

cudaMemset(dev_res,0,sizeof(res));

    // Convloving 32x32 matrix with 5x5 kernel, getting 28x28 matrix as a result
dim3 blocks(28,28,1);
dim3 threads(5,5,1);

for (int itr = 0; itr<10; itr++)
{
    conv<<<blocks, threads>>>(dev_image,dev_kernel, dev_res);
}

cudaMemcpy(res, dev_res, sizeof(res), cudaMemcpyDeviceToHost);

printf("res[0]=%f\n",res[0]);

cudaFree(dev_kernel);
cudaFree(dev_image);
cudaFree(dev_res);

exit (0);

似乎我处理了并发问题,因此这不是根本原因。感谢您的帮助。

It seems that I handled the concurrency issue, so it shouldn't be the root-cause. I appreciate any help.

推荐答案

您正在对 float 值并期望完美的精度。

You're doing arbitrary arithmetic on float values and expecting perfect accuracy.

float 值可以完美地存储整数,直到某个尾数。一旦超过该值,浮点运算就会变得不精确。自然,结果中倾向于累加最多的值(在 res 数组末尾的值)将首先显示此效果。

float values can store integers perfectly up to a certain mantissa. Once we exceed that value, then float operations begin to become imprecise. Naturally, the values in your result that tend to accumulate to the largest numbers (those towards the end of the res array) will show this effect first.

让我们将循环的乘积计入您的内核,并将循环的计数计入您的主机代码中围绕内核的 total_loops 。对于 total_loops total_loops 。之后,随着您逐渐增加 total_loops ,错误开始蔓延,从 res 的结尾开始

Let's call the product of the loops count in your kernel and the loops count in your host code around the kernel the total_loops. For a total_loops value up to around 700, I get "precise" results, that is, all results are evenly divisible by total_loops. After that, as you gradually increase total_loops, then the errors start to creep in, starting at the end of the res array.

您可以切换为 double 而不是 float ,并且您的结果将有所不同,只不过无法方便地获得用于double的atomicAdd版本。但是,显示了如何创建任意原子操作,它们给出的示例恰好实现了

You could switch to double instead of float and your results would be different, except that a version of atomicAdd for double isn't conveniently available. However, the programming guide shows how to create arbitrary atomic operations, and the example they give just happens to be implementing atomicAdd for double

因此,对代码的以下修改使您可以探索这两种想法:

So the following modification of your code allows you to explore both ideas:


  • 如果要查看如何解决此问题,请将定义更改为 USE_DOUBLE

  • 相反,如果要查看减少 total_loops 如何解决此问题,请将LOOPS1的定义从100更改为70。

  • 我还要提到,会调用d内核调用(您只介绍了少数几个,而不是内核),但这不是问题。

  • if you want to see how double fixes the issue, change the define to USE_DOUBLE
  • instead, if you want to see how reducing the total_loops fixes the issue, change the LOOPS1 define from 100 to 70.
  • I would also mention that it's good practice to do cuda error checking on all API calls and kernel calls (you're only covering a few, and not the kernel), but it's not an issue in this case.

以下是代码:

#include <stdio.h>
#define LOOPS1 100
#define LOOPS2 10
// set to USE_DOUBLE or USE_FLOAT
#define USE_FLOAT

#ifndef USE_DOUBLE
typedef float mytype;
#else
typedef double mytype;
#endif

__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;
    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));
    } while (assumed != old);
    return __longlong_as_double(old);
}

__global__ void conv(mytype *input, mytype *kernel, mytype *target)
{
    for (long i = 0; i <LOOPS1; i++)
    {
        atomicAdd(target+gridDim.y*blockIdx.x+blockIdx.y,input[(blockIdx.x+threadIdx.x)*(blockDim.y+gridDim.y-1)+(blockIdx.y+threadIdx.y)]*kernel[threadIdx.x*blockDim.y+threadIdx.y]);
    }
}

int main(){

mytype image[1024] = {0.0};
mytype kernel[] =
{
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f,
    1.0f, 1.0f, 1.0f, 1.0f, 1.0f
};

mytype res[784]={0};

for (int i = 0; i < 1024; i++)
{
    image[i]=(mytype)i;
} // Got 32x32 matrix

cudaError_t cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    exit (-1);
}

mytype *dev_image = 0;
mytype *dev_kernel = 0;
mytype *dev_res = 0;

// Allocate GPU buffers for three vectors (two input, one output)    .
cudaStatus = cudaMalloc((void**)&dev_image, sizeof(image));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaStatus = cudaMalloc((void**)&dev_kernel, sizeof(kernel));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaStatus = cudaMalloc((void**)&dev_res, sizeof(res));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed!");
    exit(-10);
}

cudaMemcpy(dev_image, image, sizeof(image), cudaMemcpyHostToDevice);
cudaMemcpy(dev_kernel, kernel, sizeof(kernel), cudaMemcpyHostToDevice);

cudaMemset(dev_res,0,sizeof(res));

    // Convloving 32x32 matrix with 5x5 kernel, getting 28x28 matrix as a result
dim3 blocks(28,28,1);
dim3 threads(5,5,1);

for (int itr = 0; itr<LOOPS2; itr++)
{
    conv<<<blocks, threads>>>(dev_image,dev_kernel, dev_res);
}

cudaMemcpy(res, dev_res, sizeof(res), cudaMemcpyDeviceToHost);

printf("results:\n");
for (int i = 0; i< (28*28); i++)
  if ((((int)res[i])%(LOOPS1*LOOPS2)) != 0) {printf("first error index: %d, value: %f\n", i, res[i]); return 1;}

cudaFree(dev_kernel);
cudaFree(dev_image);
cudaFree(dev_res);

  return 0;
}

请注意,即使您使用 double ,如果您累积到足够大的值,该问题最终将再次出现。

Note that even if you use double, the problem will eventually show up again if you accumulate to large enough values.

还请注意,这实际上并不是CUDA / GPU问题。主机代码中的 float 具有类似的限制。

Also note that this isn't really a CUDA/GPU issue. float in host code has similar restrictions.

这篇关于CUDA atomicAdd()产生错误结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 04:24