问题描述
我试图编译一个内核,使用动态并行运行CUBLAS到cubin文件。
当我尝试使用命令编译代码
I'm trying to compile a kernel that uses dynamic parallelism to run CUBLAS to a cubin file.When I try to compile the code using the command
nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu
$ c> ptxas fatal:unresolved extern function'cublasCreate_v2
I get ptxas fatal : Unresolved extern function 'cublasCreate_v2
如果我添加 -rdc = true
编译选项它编译正常,但是当我尝试加载模块使用cuModuleLoad我得到错误500:CUDA_ERROR_NOT_FOUND。来自cuda.h:
If I add the -rdc=true
compile option it compiles fine, but when I try to load the module using cuModuleLoad I get error 500: CUDA_ERROR_NOT_FOUND. From cuda.h:
/**
* This indicates that a named symbol was not found. Examples of symbols
* are global/constant variable names, texture names, and surface names.
*/
CUDA_ERROR_NOT_FOUND = 500,
内核代码:
#include <stdio.h>
#include <cublas_v2.h>
extern "C" {
__global__ void a() {
cublasHandle_t cb_handle = NULL;
cudaStream_t stream;
if( threadIdx.x == 0 ) {
cublasStatus_t status = cublasCreate_v2(&cb_handle);
cublasSetPointerMode_v2(cb_handle, CUBLAS_POINTER_MODE_HOST);
if (status != CUBLAS_STATUS_SUCCESS) {
return;
}
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
cublasSetStream_v2(cb_handle, stream);
}
__syncthreads();
int jp;
double A[3];
A[0] = 4.0f;
A[1] = 5.0f;
A[2] = 6.0f;
cublasIdamax_v2(cb_handle, 3, A, 1, &jp );
}
}
注意: A
是本地的,因此指向 cublasIdamax_v2
的指针上的数据未定义,因此 jp
在此代码中最终作为一个或多或少的随机值。正确的方法是在全局内存中使用 A
。
NOTE: The scope of A
is local, so the data at the pointer given to cublasIdamax_v2
is undefined, and so jp
ends up as a more or less random value in this code. The correct way to do it would be to have A
in global memory.
主机代码: / strong>
Host code:
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
int main() {
CUresult error;
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction testkernel;
// Initialize
error = cuInit(0);
if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error);
error = cuDeviceGet(&cuDevice, 0);
if (error != CUDA_SUCCESS) printf("ERROR: cuInit, %i\n", error);
error = cuCtxCreate(&cuContext, 0, cuDevice);
if (error != CUDA_SUCCESS) printf("ERROR: cuCtxCreate, %i\n", error);
error = cuModuleLoad(&cuModule, "test.cubin");
if (error != CUDA_SUCCESS) printf("ERROR: cuModuleLoad, %i\n", error);
error = cuModuleGetFunction(&testkernel, cuModule, "a");
if (error != CUDA_SUCCESS) printf("ERROR: cuModuleGetFunction, %i\n", error);
return 0;
}
主机代码使用 nvcc -lcuda test .cpp
。
如果我用一个简单的内核(下面)替换内核并编译它 -rdc = true
,它工作正常。
The host code is compiled using nvcc -lcuda test.cpp
.If I replace the kernel with a simple kernel (below) and compile it without -rdc=true
, it works fine.
简单工作内核
#include <stdio.h>
extern "C" {
__global__ void a() {
printf("hello\n");
}
}
提前感谢
- Soren
推荐答案
只是在您的第一种方法中缺少 -dlink
:
You are just missing -dlink
in your first approach:
nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu -dlink
您还可以通过两个步骤:
You can also do that in two steps:
nvcc -m64 test.cu -gencode arch=compute_35,code=sm_35 -o test.o -dc
nvcc -dlink test.o -arch sm_35 -lcublas_device -lcudadevrt -cubin -o test.cubin
这篇关于CUDA 5.0:CUBIN和CUBLAS_device,计算能力3.5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!