CUDA：通过点积示例了解循环中寄存器文件中变量的行为

本文介绍了CUDA：通过点积示例了解循环中寄存器文件中变量的行为的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对CUDA编程非常陌生。目前，我在理解以下程序来计算两个向量的点积的行为时遇到困难。

I am very new to CUDA programming. Currently I have difficulties in understanding the behavior of the following program to calculate dot product of two vectors.

点积内核 dotProd 计算每个元素的乘积并将结果简化为较短的向量长度 blockDim.x * gridDim.x 。然后将向量 * out 中的结果复制回Host以进一步缩减。

The dot product kernel, dotProd calculates the product of each element and reduce the the results to a shorter vector of length blockDim.x*gridDim.x. Then the results in the vector *out is copied back to Host for further reduction.

第二个版本， dotProdWithSharedMem 是从《 CUDA示例》 书中复制的，请参见。

The second version, dotProdWithSharedMem is copied from the CUDA By Example book, see here.

我的问题是：

当内核以足够多的线程（ nThreadsPerBlock * nblocks> = vector_length ）启动时， dotProd 匹配CPU计算的结果，但是 dotProdWithSharedMem 的结果与两者不同。可能是什么原因？ $ dot_prod.o 17512 的可能输出：

When the kernel is initiated with enough threads (nThreadsPerBlock*nblocks >= vector_length), the result of dotProd matches the one calculated by CPU, but the result of dotProdWithSharedMem is different from the two. What can be the possible causes? A possible output of $ dot_prod.o 17 512:

    Number of threads per block : 256
    Number of blocks in the grid: 512
    Total number of threads     : 131072
    Length of vectors           : 131072

    GPU using registers: 9.6904191971, time consummed: 0.56154 ms
    GPU using shared   : 9.6906833649, time consummed: 0.04473 ms
    CPU result         : 9.6904191971, time consummed: 0.28504 ms

 
 在没有足够线程的情况下启动内核时（ nThreadsPerBlock * nblocks< vector_length ），GPU结果似乎不太准确。但是， while 循环应该可以解决此问题。我想循环中的寄存器变量 temp 可能会发生问题，否则结果应与问题1相同。可能的输出为 $ dot_prod.o 17256 ：

When the kernel is initiated with not enough threads (nThreadsPerBlock*nblocks < vector_length), the GPU results seem to be less accurate. However the while loop is supposed to handle this problem. I guess there might be something happen to the registers variable temp in the loop, otherwise the result should remain the same as in question 1. A possible output of $ dot_prod.o 17 256:

Number of threads per block : 256
Number of blocks in the grid: 256
Total number of threads     : 65536
Length of vectors           : 131072

GPU using registers: 9.6906890869, time consummed: 0.31478 ms
GPU using shared   : 9.6906604767, time consummed: 0.03530 ms
CPU result         : 9.6904191971, time consummed: 0.28404 ms

 
 我不太了解缓存在 dotProdWithSharedMem 。为什么它是 nThreadsPerBlock 个元素而不是线程总数 nThreadsPerBlock * nblocks 个元素？我认为应该是 temp 值的正确数目，这正确吗？

I don't quite understand the size of the cache in dotProdWithSharedMem. Why it is of nThreadsPerBlock elements other than the total number of threads nThreadsPerBlock * nblocks? I think that should be the right number of temp values, is this correct?

代码：

#include <iostream>
#include <string>
#include <cmath>
#include <chrono>
#include <cuda.h>


#define PI (float) 3.141592653589793

const size_t nThreadsPerBlock = 256;


static void HandleError(cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
    }
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))


__global__ void dotProd(int length, float *u, float *v, float *out) {
    unsigned tid = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned tid_const = threadIdx.x + blockDim.x * blockIdx.x;
    float temp = 0;

    while (tid < length) {
        temp += u[tid] * v[tid];
        tid  += blockDim.x * gridDim.x;
    }
    out[tid_const] = temp;
}


__global__ void dotProdWithSharedMem(int length, float *u, float *v, float *out) {
    __shared__ float cache[nThreadsPerBlock];
    unsigned tid = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned cid = threadIdx.x;

    float temp = 0;
    while (tid < length) {
        temp += u[tid] * v[tid];
        tid  += blockDim.x * gridDim.x;
    }

    cache[cid] = temp;
    __syncthreads();

    int i = blockDim.x/2;
    while (i != 0) {
        if (cid < i) {
            cache[cid] += cache[cid + i];
        }
        __syncthreads();
        i /= 2;
    }

    if (cid == 0) {
        out[blockIdx.x] = cache[0];
    }
}


int main(int argc, char* argv[]) {

    size_t vec_len  = 1 << std::stoi(argv[1]);
    size_t size     = vec_len * sizeof(float);
    size_t nblocks  = std::stoi(argv[2]);
    size_t size_out   = nThreadsPerBlock*nblocks*sizeof(float);
    size_t size_out_2 = nblocks*sizeof(float);

    float *u     = (float *)malloc(size);
    float *v     = (float *)malloc(size);
    float *out   = (float *)malloc(size_out);
    float *out_2 = (float *)malloc(size_out_2);

    float *dev_u, *dev_v, *dev_out, *dev_out_2; // Device arrays

    float res_gpu = 0;
    float res_gpu_2 = 0;
    float res_cpu = 0;

    dim3 dimGrid(nblocks, 1, 1);
    dim3 dimBlocks(nThreadsPerBlock, 1, 1);

    // Initiate values
    for(size_t i=0; i<vec_len; ++i) {
        u[i] = std::sin(i*PI*1E-2);
        v[i] = std::cos(i*PI*1E-2);
    }

    HANDLE_ERROR( cudaMalloc((void**)&dev_u, size) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_v, size) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_out, size_out) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_out_2, size_out_2) );
    HANDLE_ERROR( cudaMemcpy(dev_u, u, size, cudaMemcpyHostToDevice) );
    HANDLE_ERROR( cudaMemcpy(dev_v, v, size, cudaMemcpyHostToDevice) );


    auto t1_gpu = std::chrono::system_clock::now();
    dotProd <<<dimGrid, dimBlocks>>> (vec_len, dev_u, dev_v, dev_out);
    cudaDeviceSynchronize();
    HANDLE_ERROR( cudaMemcpy(out, dev_out, size_out, cudaMemcpyDeviceToHost) );
    // Reduction
    for(size_t i=0; i<nThreadsPerBlock*nblocks; ++i) {
        res_gpu += out[i];
    }


    auto t2_gpu = std::chrono::system_clock::now();
    // GPU version with shared memory
    dotProdWithSharedMem <<<dimGrid, dimBlocks>>> (vec_len, dev_u, dev_v, dev_out_2);
    cudaDeviceSynchronize();
    HANDLE_ERROR( cudaMemcpy(out_2, dev_out_2, size_out_2, cudaMemcpyDeviceToHost) );
    // Reduction
    for(size_t i=0; i<nblocks; ++i) {
        res_gpu_2 += out_2[i];
    }
    auto t3_gpu = std::chrono::system_clock::now();


    // CPU version for result-check
    for(size_t i=0; i<vec_len; ++i) {
        res_cpu += u[i] * v[i];
    }
    auto t2_cpu = std::chrono::system_clock::now();


    double t_gpu = std::chrono::duration <double, std::milli> (t2_gpu - t1_gpu).count();
    double t_gpu_2 = std::chrono::duration <double, std::milli> (t3_gpu - t2_gpu).count();
    double t_cpu = std::chrono::duration <double, std::milli> (t2_cpu - t3_gpu).count();

    printf("Number of threads per block : %i \n", nThreadsPerBlock);
    printf("Number of blocks in the grid: %i \n", nblocks);
    printf("Total number of threads     : %i \n", nThreadsPerBlock*nblocks);
    printf("Length of vectors           : %i \n\n", vec_len);
    printf("GPU using registers: %.10f, time consummed: %.5f ms\n", res_gpu, t_gpu);
    printf("GPU using shared   : %.10f, time consummed: %.5f ms\n", res_gpu_2, t_gpu_2);
    printf("CPU result         : %.10f, time consummed: %.5f ms\n", res_cpu, t_cpu);

    cudaFree(dev_u);
    cudaFree(dev_v);
    cudaFree(dev_out);
    cudaFree(dev_out_2);
    free(u);
    free(v);
    free(out);
    free(out_2);

    return 0;
}

感谢您耐心阅读本篇长篇文章！

Thank you for your patience for having done reading this LONG post! Any help will be deeply appreciated!

Niko

`推荐答案`

您'重新探索 float 精度的极限以及与浮点运算顺序有关的变化。此处的实际准确性将取决于确切的数据和确切的操作顺序。不同的算法将具有不同的运算顺序，因此结果也将不同。

You're exploring the limits of float precision combined with the variation associated with floating point order of operations. The actual "accuracy" here will depend on the exact data and exact order of operations. The different algorithms will have different order of operations, and therefore different results.

您可能需要阅读。

您似乎在做的一个假设是CPU结果是准确的，没有该假设的任何依据。

One of the assumptions you seem to be making is that the CPU result is the accurate one without any justification for that assumption.

如果我们将准确性定义为差异（即接近度）在结果与数值正确的结果之间，我怀疑共享内存结果是更准确的结果。

If we define "accuracy" as the difference (i.e. "closeness") between the result and the numerically correct result, I suspect that the shared memory result is the more accurate one.

如果我们将您的代码转换为使用 double 类型而不是 float 类型，我们注意到：

If we convert your code to use double type instead of float type, we observe that:

 
 所有3种方法的结果都更接近（在打印输出中相同）。
 
   double 结果不匹配任何 float 情况。
 
   floa的共享内存结果t 情况实际上是最接近 double 情况的结果。

The result of all 3 approaches are much closer (identical in the printout).
The double results don't match any of the float case.
The shared memory result from the float case is actually the result that is closest to the double case results.

这里是一个测试案例，证明了这一点：

Here's a test case demonstrating this:

$ cat t397.cu
#include <iostream>
#include <string>
#include <cmath>
#include <chrono>
#include <cuda.h>

#ifndef USE_DOUBLE
typedef float ft;
#else
typedef double ft;
#endif
#define PI (ft) 3.141592653589793

const size_t nThreadsPerBlock = 256;


static void HandleError(cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
    }
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))


__global__ void dotProd(int length, ft *u, ft *v, ft *out) {
    unsigned tid = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned tid_const = threadIdx.x + blockDim.x * blockIdx.x;
    ft temp = 0;

    while (tid < length) {
        temp += u[tid] * v[tid];
        tid  += blockDim.x * gridDim.x;
    }
    out[tid_const] = temp;
}


__global__ void dotProdWithSharedMem(int length, ft *u, ft *v, ft *out) {
    __shared__ ft cache[nThreadsPerBlock];
    unsigned tid = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned cid = threadIdx.x;

    ft temp = 0;
    while (tid < length) {
        temp += u[tid] * v[tid];
        tid  += blockDim.x * gridDim.x;
    }

    cache[cid] = temp;
    __syncthreads();

    int i = blockDim.x/2;
    while (i != 0) {
        if (cid < i) {
            cache[cid] += cache[cid + i];
        }
        __syncthreads();
        i /= 2;
    }

    if (cid == 0) {
        out[blockIdx.x] = cache[0];
    }
}


int main(int argc, char* argv[]) {

    size_t vec_len  = 1 << std::stoi(argv[1]);
    size_t size     = vec_len * sizeof(ft);
    size_t nblocks  = std::stoi(argv[2]);
    size_t size_out   = nThreadsPerBlock*nblocks*sizeof(ft);
    size_t size_out_2 = nblocks*sizeof(ft);

    ft *u     = (ft *)malloc(size);
    ft *v     = (ft *)malloc(size);
    ft *out   = (ft *)malloc(size_out);
    ft *out_2 = (ft *)malloc(size_out_2);

    ft *dev_u, *dev_v, *dev_out, *dev_out_2; // Device arrays

    ft res_gpu = 0;
    ft res_gpu_2 = 0;
    ft res_cpu = 0;

    dim3 dimGrid(nblocks, 1, 1);
    dim3 dimBlocks(nThreadsPerBlock, 1, 1);

    // Initiate values
    for(size_t i=0; i<vec_len; ++i) {
        u[i] = std::sin(i*PI*1E-2);
        v[i] = std::cos(i*PI*1E-2);
    }

    HANDLE_ERROR( cudaMalloc((void**)&dev_u, size) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_v, size) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_out, size_out) );
    HANDLE_ERROR( cudaMalloc((void**)&dev_out_2, size_out_2) );
    HANDLE_ERROR( cudaMemcpy(dev_u, u, size, cudaMemcpyHostToDevice) );
    HANDLE_ERROR( cudaMemcpy(dev_v, v, size, cudaMemcpyHostToDevice) );


    auto t1_gpu = std::chrono::system_clock::now();
    dotProd <<<dimGrid, dimBlocks>>> (vec_len, dev_u, dev_v, dev_out);
    cudaDeviceSynchronize();
    HANDLE_ERROR( cudaMemcpy(out, dev_out, size_out, cudaMemcpyDeviceToHost) );
    // Reduction
    for(size_t i=0; i<nThreadsPerBlock*nblocks; ++i) {
        res_gpu += out[i];
    }


    auto t2_gpu = std::chrono::system_clock::now();
    // GPU version with shared memory
    dotProdWithSharedMem <<<dimGrid, dimBlocks>>> (vec_len, dev_u, dev_v, dev_out_2);
    cudaDeviceSynchronize();
    HANDLE_ERROR( cudaMemcpy(out_2, dev_out_2, size_out_2, cudaMemcpyDeviceToHost) );
    // Reduction
    for(size_t i=0; i<nblocks; ++i) {
        res_gpu_2 += out_2[i];
    }
    auto t3_gpu = std::chrono::system_clock::now();


    // CPU version for result-check
    for(size_t i=0; i<vec_len; ++i) {
        res_cpu += u[i] * v[i];
    }
    auto t2_cpu = std::chrono::system_clock::now();


    double t_gpu = std::chrono::duration <double, std::milli> (t2_gpu - t1_gpu).count();
    double t_gpu_2 = std::chrono::duration <double, std::milli> (t3_gpu - t2_gpu).count();
    double t_cpu = std::chrono::duration <double, std::milli> (t2_cpu - t3_gpu).count();

    printf("Number of threads per block : %i \n", nThreadsPerBlock);
    printf("Number of blocks in the grid: %i \n", nblocks);
    printf("Total number of threads     : %i \n", nThreadsPerBlock*nblocks);
    printf("Length of vectors           : %i \n\n", vec_len);
    printf("GPU using registers: %.10f, time consummed: %.5f ms\n", res_gpu, t_gpu);
    printf("GPU using shared   : %.10f, time consummed: %.5f ms\n", res_gpu_2, t_gpu_2);
    printf("CPU result         : %.10f, time consummed: %.5f ms\n", res_cpu, t_cpu);

    cudaFree(dev_u);
    cudaFree(dev_v);
    cudaFree(dev_out);
    cudaFree(dev_out_2);
    free(u);
    free(v);
    free(out);
    free(out_2);

    return 0;
}
$ nvcc -std=c++11 t397.cu -o t397
$ ./t397 17 512
Number of threads per block : 256
Number of blocks in the grid: 512
Total number of threads     : 131072
Length of vectors           : 131072

GPU using registers: 9.6904191971, time consummed: 0.89290 ms
GPU using shared   : 9.6906833649, time consummed: 0.04289 ms
CPU result         : 9.6904191971, time consummed: 0.41527 ms
$ nvcc -std=c++11 t397.cu -o t397 -DUSE_DOUBLE
$ ./t397 17 512
Number of threads per block : 256
Number of blocks in the grid: 512
Total number of threads     : 131072
Length of vectors           : 131072

GPU using registers: 9.6913433287, time consummed: 1.33016 ms
GPU using shared   : 9.6913433287, time consummed: 0.05032 ms
CPU result         : 9.6913433287, time consummed: 0.41275 ms
$

                        这篇关于CUDA：通过点积示例了解循环中寄存器文件中变量的行为的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！