cuPrintf什么也不做（程序使用固定+映射内存，CUBLAS也）

本文介绍了cuPrintf什么也不做（程序使用固定+映射内存，CUBLAS也）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从CUDA内核打印几个值，并尝试使用cuPrintf。我的计算能力是1.1，所以我不能使用printf。程序正确编译，也不会给出任何运行时错误。然而，cuPrintf行似乎没有做任何事情。这里有一些我尝试过的事情：

I need to print a few values from a CUDA kernel, and have tried using cuPrintf. My compute capability is 1.1, and so I cannot use printf. The program compiles correctly and does not give any runtime error either. However, the cuPrintf lines seem to be doing nothing at all. Here are some of the things I tried:

编译-arch sm_11

请确保cudaPrintfInit和cudaPrintfDisplay返回cudaSuccess，并且cudaPrintfInit和cudaPrintfEnd

Compile with -arch sm_11
Surround each kernel invocation with cudaPrintfInit and cudaPrintfEnd
Ensure that the number of characters is small enough to work with the default buffer size
Ensure that cudaPrintfInit and cudaPrintfDisplay return cudaSuccess

除了常规内容之外，我的程序还使用以下内容：

My program uses the following in addition to the regular stuff:

CUBLAS库

网页锁定（固定）+映射内存

为什么不调用cuPrintf做任何事情？

Why isn't the call to cuPrintf doing anything?

编辑

以下是一些相关的片段代码：

Edit
Here are some relevant snippets from the code:

__global__ void swap_rows(float *d_A, int r1, int r2, int n)
{
  int i = r1;
  int j = blockDim.x*blockIdx.x + threadIdx.x;
  cuPrintf("(%d,%d) ", i, j);

  if(j >= n) return;
  float tmp;
  tmp = d_A[L(i,j)];
  d_A[L(i,j)] = d_A[L(r2,j)];
  d_A[L(r2,j)] = tmp;
}

extern "C" float *someFunction(float *_A, float *_b, int n)
{
  int i, i_max, k, n2 = n*n;
  dim3 lblock_size(32,1);
  dim3 lgrid_size(n/lblock_size.x + 1, 1);
  float *d_A, *d_b, *d_x, *h_A, *h_b, *h_x, tmp, dotpdt;

  cublasStatus status;
  cudaError_t ret;

  if((ret = cudaSetDeviceFlags(cudaDeviceMapHost)) != cudaSuccess) {
    fprintf(stderr, "Error setting device flag: %s\n",
            cudaGetErrorString(ret));
    return NULL;
  }

  // Allocate mem for A and copy data
  if((ret = cudaHostAlloc((void **)&h_A, n2 * sizeof(float),
                            cudaHostAllocMapped)) != cudaSuccess) {
    fprintf(stderr, "Error allocating page-locked h_A: %s\n",
            cudaGetErrorString(ret));
    return NULL;
  }

  if((ret = cudaHostGetDevicePointer((void **)&d_A, h_A, 0)) != cudaSuccess) {
    fprintf(stderr, "Error getting devptr for page-locked h_A: %s\n",
            cudaGetErrorString(ret));
    return NULL;
  }

  if((ret = cudaMemcpy(h_A, _A, n2 * sizeof(float), cudaMemcpyHostToHost)) !=
      cudaSuccess) {
    fprintf(stderr, "Error copying A into h_A: %s\n", cudaGetErrorString(ret));
    return NULL;
  }

  // Some code to compute k and i_max

  if(cudaPrintfInit() != cudaSuccess)
    printf("cudaPrintfInit failed\n");

  swap_rows<<<lgrid_size,lblock_size>>>(d_A, k, i_max, n);
  if((ret = cudaThreadSynchronize()) != cudaSuccess)
    fprintf(stderr, "Synchronize failed!\n", cudaGetErrorString(ret));

  if(cudaPrintfDisplay(stdout, true) != cudaSuccess)
    printf("cudaPrintfDisplay failed\n");
  cudaPrintfEnd();

// Some more code
}

提到：这些方法是作为一个动态链接模块（共享对象）单独编译的（从main（）函数）。

I forgot to mention: these methods are compiled separately (from the main() function) as a dynamically linked module (shared object).

程序使用固定

cuPrintf什么也不做（程序使用固定+映射内存，CUBLAS也）

问题描述

推荐答案