


I need to print a few values from a CUDA kernel, and have tried using cuPrintf. My compute capability is 1.1, and so I cannot use printf. The program compiles correctly and does not give any runtime error either. However, the cuPrintf lines seem to be doing nothing at all. Here are some of the things I tried:

  • 编译-arch sm_11

  • 请确保cudaPrintfInit和cudaPrintfDisplay返回cudaSuccess,并且cudaPrintfInit和cudaPrintfEnd

  • Compile with -arch sm_11
  • Surround each kernel invocation with cudaPrintfInit and cudaPrintfEnd
  • Ensure that the number of characters is small enough to work with the default buffer size
  • Ensure that cudaPrintfInit and cudaPrintfDisplay return cudaSuccess


My program uses the following in addition to the regular stuff:


  • 网页锁定(固定)+映射内存


Why isn't the call to cuPrintf doing anything?



Here are some relevant snippets from the code:

__global__ void swap_rows(float *d_A, int r1, int r2, int n)
  int i = r1;
  int j = blockDim.x*blockIdx.x + threadIdx.x;
  cuPrintf("(%d,%d) ", i, j);

  if(j >= n) return;
  float tmp;
  tmp = d_A[L(i,j)];
  d_A[L(i,j)] = d_A[L(r2,j)];
  d_A[L(r2,j)] = tmp;

extern "C" float *someFunction(float *_A, float *_b, int n)
  int i, i_max, k, n2 = n*n;
  dim3 lblock_size(32,1);
  dim3 lgrid_size(n/lblock_size.x + 1, 1);
  float *d_A, *d_b, *d_x, *h_A, *h_b, *h_x, tmp, dotpdt;

  cublasStatus status;
  cudaError_t ret;

  if((ret = cudaSetDeviceFlags(cudaDeviceMapHost)) != cudaSuccess) {
    fprintf(stderr, "Error setting device flag: %s\n",
    return NULL;

  // Allocate mem for A and copy data
  if((ret = cudaHostAlloc((void **)&h_A, n2 * sizeof(float),
                            cudaHostAllocMapped)) != cudaSuccess) {
    fprintf(stderr, "Error allocating page-locked h_A: %s\n",
    return NULL;

  if((ret = cudaHostGetDevicePointer((void **)&d_A, h_A, 0)) != cudaSuccess) {
    fprintf(stderr, "Error getting devptr for page-locked h_A: %s\n",
    return NULL;

  if((ret = cudaMemcpy(h_A, _A, n2 * sizeof(float), cudaMemcpyHostToHost)) !=
      cudaSuccess) {
    fprintf(stderr, "Error copying A into h_A: %s\n", cudaGetErrorString(ret));
    return NULL;

  // Some code to compute k and i_max

  if(cudaPrintfInit() != cudaSuccess)
    printf("cudaPrintfInit failed\n");

  swap_rows<<<lgrid_size,lblock_size>>>(d_A, k, i_max, n);
  if((ret = cudaThreadSynchronize()) != cudaSuccess)
    fprintf(stderr, "Synchronize failed!\n", cudaGetErrorString(ret));

  if(cudaPrintfDisplay(stdout, true) != cudaSuccess)
    printf("cudaPrintfDisplay failed\n");

// Some more code


I forgot to mention: these methods are compiled separately (from the main() function) as a dynamically linked module (shared object).


想象出来:我有另一个内核,给出了一个无效的配置参数的错误。我使用的块大小为32 * 32 * 1的内核,这超过了每个块允许的最大线程数。一旦固定,cuPrintf开始工作。

Figured it out: I have another kernel which gave an "invalid configuration argument" error. I was using a block size of 32*32*1 for that kernel, and this exceeds the maximum number of threads permissible per block. As soon as this was fixed, the cuPrintf's started working.


07-23 09:54