本文介绍了每个 CUDA 线程的本地内存量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了 NVIDIA 文档 (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications,表 #12) 每个线程的本地内存量我的 GPU 为 512 Ko(GTX 580,计算能力 2.0).

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).

我尝试在使用 CUDA 6.5 的 Linux 上检查此限制,但未成功.

I tried unsuccessfully to check this limit on Linux with CUDA 6.5.

这是我使用的代码(它的唯一目的是测试本地内存限制,它不会进行任何有用的计算):

Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):

#include <iostream>
#include <stdio.h>

#define MEMSIZE 65000  // 65000 -> out of memory, 60000 -> ok

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr,"GPUassert: %s %s %d
", cudaGetErrorString(code), file, line);
        if( abort )
            exit(code);
    }
}

inline void gpuCheckKernelExecutionError( const char *file, int line)
{
    gpuAssert( cudaPeekAtLastError(), file, line);
    gpuAssert( cudaDeviceSynchronize(), file, line);
}


__global__ void kernel_test_private(char *output)
{
    int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
    int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row

    char tmp[MEMSIZE];
    for( int i = 0; i < MEMSIZE; i++)
        tmp[i] = 4*r + c; // dummy computation in local mem
    for( int i = 0; i < MEMSIZE; i++)
        output[i] = tmp[i];
}

int main( void)
{
    printf( "MEMSIZE=%d bytes.
", MEMSIZE);

    // allocate memory
    char output[MEMSIZE];
    char *gpuOutput;
    cudaMalloc( (void**) &gpuOutput, MEMSIZE);

    // run kernel
    dim3 dimBlock( 1, 1);
    dim3 dimGrid( 1, 1);
    kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
    gpuCheckKernelExecutionError( __FILE__, __LINE__);

    // transfer data from GPU memory to CPU memory
    cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);

    // release resources
    cudaFree(gpuOutput);
    cudaDeviceReset();

    return 0;
}

以及编译命令行:

nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu

编译没问题,报告:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info    : Function properties for _Z19kernel_test_privatePc
    65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 40 bytes cmem[0]

当我达到每个线程 65000 字节时,我在 GTX 580 上运行时遇到内存不足"错误.这是控制台中程序的确切输出:

I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:

MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48

我还使用 GTX 770 GPU 进行了测试(在具有 CUDA 6.5 的 Linux 上).MEMSIZE=200000 运行时没有错误,但 MEMSIZE=250000 在运行时出现内存不足错误".

I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.

如何解释这种行为?难道我做错了什么 ?

How to explain this behavior ? Am I doing something wrong ?

推荐答案

看来您遇到的不是本地内存限制,而是堆栈大小限制:

It seems you are running into not a local memory limitation but a stack size limitation:

ptxas info : _Z19kernel_test_privatePc 的函数属性

65000 字节堆栈帧,0 字节溢出存储,0 字节溢出加载

65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

在本例中,您原本打算成为本地的变量位于(GPU 线程)堆栈上.

The variable that you had intended to be local is on the (GPU thread) stack, in this case.

基于@njuffa 提供的信息这里,可用的堆栈大小限制是较小的:

Based on the information provided by @njuffa here, the available stack size limit is the lesser of:

  1. 最大本地内存大小(cc2.x 及更高版本为 512KB)
  2. GPU 内存/(#of SM)/(每个 SM 的最大线程数)

显然,第一个限制不是问题.我假设你有一个标准"GTX580,它有 1.5GB 内存和 16 个 SM.cc2.x 设备的每个多处理器最多有 1536 个常驻线程.这意味着我们有 1536MB/16/1536 = 1MB/16 = 65536 字节的堆栈.从总可用内存中减去一些开销和其他内存使用,因此堆栈大小限制低于 65536,显然在您的情况下介于 60000 和 65000 之间.

Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.

我怀疑在您的 GTX770 上进行类似的计算会产生类似的结果,即最大堆栈大小在 200000 到 250000 之间.

I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

这篇关于每个 CUDA 线程的本地内存量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 18:43