问题描述
我需要从CUDA内核打印几个值,并尝试使用cuPrintf。我的计算能力是1.1,所以我不能使用printf。程序正确编译,也不会给出任何运行时错误。然而,cuPrintf行似乎没有做任何事情。这里有一些我尝试过的事情:
I need to print a few values from a CUDA kernel, and have tried using cuPrintf. My compute capability is 1.1, and so I cannot use printf. The program compiles correctly and does not give any runtime error either. However, the cuPrintf lines seem to be doing nothing at all. Here are some of the things I tried:
- 编译-arch sm_11
- 请确保cudaPrintfInit和cudaPrintfDisplay返回cudaSuccess,并且cudaPrintfInit和cudaPrintfEnd
- Compile with -arch sm_11
- Surround each kernel invocation with cudaPrintfInit and cudaPrintfEnd
- Ensure that the number of characters is small enough to work with the default buffer size
- Ensure that cudaPrintfInit and cudaPrintfDisplay return cudaSuccess
除了常规内容之外,我的程序还使用以下内容:
My program uses the following in addition to the regular stuff:
- CUBLAS库
- 网页锁定(固定)+映射内存
为什么不调用cuPrintf做任何事情?
Why isn't the call to cuPrintf doing anything?
编辑
以下是一些相关的片段代码:
Edit
Here are some relevant snippets from the code:
__global__ void swap_rows(float *d_A, int r1, int r2, int n)
{
int i = r1;
int j = blockDim.x*blockIdx.x + threadIdx.x;
cuPrintf("(%d,%d) ", i, j);
if(j >= n) return;
float tmp;
tmp = d_A[L(i,j)];
d_A[L(i,j)] = d_A[L(r2,j)];
d_A[L(r2,j)] = tmp;
}
extern "C" float *someFunction(float *_A, float *_b, int n)
{
int i, i_max, k, n2 = n*n;
dim3 lblock_size(32,1);
dim3 lgrid_size(n/lblock_size.x + 1, 1);
float *d_A, *d_b, *d_x, *h_A, *h_b, *h_x, tmp, dotpdt;
cublasStatus status;
cudaError_t ret;
if((ret = cudaSetDeviceFlags(cudaDeviceMapHost)) != cudaSuccess) {
fprintf(stderr, "Error setting device flag: %s\n",
cudaGetErrorString(ret));
return NULL;
}
// Allocate mem for A and copy data
if((ret = cudaHostAlloc((void **)&h_A, n2 * sizeof(float),
cudaHostAllocMapped)) != cudaSuccess) {
fprintf(stderr, "Error allocating page-locked h_A: %s\n",
cudaGetErrorString(ret));
return NULL;
}
if((ret = cudaHostGetDevicePointer((void **)&d_A, h_A, 0)) != cudaSuccess) {
fprintf(stderr, "Error getting devptr for page-locked h_A: %s\n",
cudaGetErrorString(ret));
return NULL;
}
if((ret = cudaMemcpy(h_A, _A, n2 * sizeof(float), cudaMemcpyHostToHost)) !=
cudaSuccess) {
fprintf(stderr, "Error copying A into h_A: %s\n", cudaGetErrorString(ret));
return NULL;
}
// Some code to compute k and i_max
if(cudaPrintfInit() != cudaSuccess)
printf("cudaPrintfInit failed\n");
swap_rows<<<lgrid_size,lblock_size>>>(d_A, k, i_max, n);
if((ret = cudaThreadSynchronize()) != cudaSuccess)
fprintf(stderr, "Synchronize failed!\n", cudaGetErrorString(ret));
if(cudaPrintfDisplay(stdout, true) != cudaSuccess)
printf("cudaPrintfDisplay failed\n");
cudaPrintfEnd();
// Some more code
}
提到:这些方法是作为一个动态链接模块(共享对象)单独编译的(从main()函数)。
I forgot to mention: these methods are compiled separately (from the main() function) as a dynamically linked module (shared object).
推荐答案
想象出来:我有另一个内核,给出了一个无效的配置参数的错误。我使用的块大小为32 * 32 * 1的内核,这超过了每个块允许的最大线程数。一旦固定,cuPrintf开始工作。
Figured it out: I have another kernel which gave an "invalid configuration argument" error. I was using a block size of 32*32*1 for that kernel, and this exceeds the maximum number of threads permissible per block. As soon as this was fixed, the cuPrintf's started working.
这篇关于cuPrintf什么也不做(程序使用固定+映射内存,CUBLAS也)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!