从器件的cublas矩阵求逆

本文介绍了从器件的cublas矩阵求逆的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我想从设备运行矩阵求逆。I am trying to run a matrix inversion from the device. This logic works fine if called from the host.编译行如下（Linux）：Compilation line is as follows (Linux):nvcc -ccbin g++ -arch=sm_35 -rdc=true simple-inv.cu -o simple-inv -lcublas_device -lcudadevrt我收到以下警告，我似乎无法解决。（我的GPU是Kepler，我不知道为什么要连接到Maxwell例程，我有Cuda 6.5-14）：I get the following warning that I cannot seem to resolve. (My GPU is Kepler. I don't know why it is trying to link to Maxwell routines. I have Cuda 6.5-14):nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/../targets/x86_64-linux/lib/libcublas_device.a:maxwell_sm50_sgemm.o'程序运行时：handle 0 n = 3simple-inv.cu:63 Error [an illegal memory access was encountered]测试程序如下：#include <stdio.h>#include <stdlib.h>#include <math.h>#include <cuda_runtime.h>#include <cublas_v2.h>#define PERR(call) \ if (call) {\ fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ }#define ERRCHECK \ if (cudaPeekAtLastError()) { \ fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ }__global__ voidinv_kernel(float *a_i, float *c_o, int n){ int p[3], info[1], batch; cublasHandle_t hdl; cublasStatus_t status = cublasCreate_v2(&hdl); printf("handle %d n = %d\n", status, n); info[0] = 0; batch = 1; float *a[] = {a_i}; const float *aconst[] = {a_i}; float *c[] = {c_o}; // See // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch); __syncthreads(); printf("rf %d info %d\n", status, info[0]); status = cublasSgetriBatched(hdl, n, aconst, n, p, c, n, info, batch); __syncthreads(); printf("ri %d info %d\n", status, info[0]); cublasDestroy_v2(hdl); printf("done\n");}static voidrun_inv(float *in, float *out, int n){ float *a_d, *c_d; PERR(cudaMalloc(&a_d, n*n*sizeof(float))); PERR(cudaMalloc(&c_d, n*n*sizeof(float))); PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice)); inv_kernel<<<1, 1>>>(a_d, c_d, n); cudaDeviceSynchronize(); ERRCHECK; PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost)); PERR(cudaFree(a_d)); PERR(cudaFree(c_d));}intmain(int argc, char **argv){ float c[9]; float a[] = { 1, 2, 3, 0, 4, 5, 1, 0, 6 }; run_inv(a, c, 3); return 0;}我已按照 http://docs.nvidia.com/cuda/cublas/index.html#device-api 第2.1节.9，但我怀疑我已经忽略了某些东西。I have followed the guide at http://docs.nvidia.com/cuda/cublas/index.html#device-api section 2.1.9, but I suspect I have overlooked something.注意：在11/24编辑使用正确的指针输入。这仍然会报告内核中的非法内存访问。Note: Edited on 11/24 to use correct pointer inputs. This still reports illegal memory access inside the kernel.推荐答案 sm_50的警告是良性的。这是我说在这种情况下可以安全地忽略的方式。The warnings about sm_50 are benign. That's my way of saying "they can be safely ignored in this case".关于你目前发布的代码，问题涉及到动态并行围绕线程本地内存使用的文档 here 。Regarding the code you currently have posted, the problem relates to what is described in the dynamic parallelism documentation around the use of thread-local memory here.简而言之，在子内核启动中，父线程的本地内存是超出范围。虽然不是很明显，从设备代码的cublas调用（尝试）启动子内核。这意味着这样的声明：In a nutshell, local memory of the parent thread is "out of scope" in a child kernel launch. Although it's not entirely obvious, the cublas calls from device code are (attempting) to launch child kernels. This means that declarations like this:int p[3], info[1], ）传递给子内核。指针本身的数值不会被破坏，但它们不会指向子内核的内存空间中的任何有意义的。will be problematic if those pointers (e.g. p, info) are passed to a child kernel. The numerical values of the pointers themselves will not be corrupted, but they will not point to anything "meaningful" in the memory space of the child kernel.有多种方法解决这个问题，但是一个可能的解决方案是使用通过 in-kernel malloc 。There are multiple ways to solve this, but one possible solution is to replace any stack/local allocations of this type with allocations from the "device heap" which can be made via in-kernel malloc.这是一个完全工作的代码/示例，似乎正确地为我工作。输出对于给定样本矩阵的反演似乎是正确的：Here is a fully worked code/example that seems to work correctly for me. The output seems to be correct for the inversion of the given sample matrix:$ cat t605.cu#include <stdio.h>#include <stdlib.h>#include <math.h>#include <cuda_runtime.h>#include <cublas_v2.h>#define PERR(call) \ if (call) {\ fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ }#define ERRCHECK \ if (cudaPeekAtLastError()) { \ fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ }__global__ voidinv_kernel(float *a_i, float *c_o, int n){ int *p = (int *)malloc(3*sizeof(int)); int *info = (int *)malloc(sizeof(int)); int batch; cublasHandle_t hdl; cublasStatus_t status = cublasCreate_v2(&hdl); printf("handle %d n = %d\n", status, n); info[0] = 0; batch = 1; float **a = (float **)malloc(sizeof(float *)); *a = a_i; const float **aconst = (const float **)a; float **c = (float **)malloc(sizeof(float *)); *c = c_o; // See // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch); __syncthreads(); printf("rf %d info %d\n", status, info[0]); status = cublasSgetriBatched(hdl, n, aconst, n, p, c, n, info, batch); __syncthreads(); printf("ri %d info %d\n", status, info[0]); cublasDestroy_v2(hdl); printf("done\n");}static voidrun_inv(float *in, float *out, int n){ float *a_d, *c_d; PERR(cudaMalloc(&a_d, n*n*sizeof(float))); PERR(cudaMalloc(&c_d, n*n*sizeof(float))); PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice)); inv_kernel<<<1, 1>>>(a_d, c_d, n); cudaDeviceSynchronize(); ERRCHECK; PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost)); PERR(cudaFree(a_d)); PERR(cudaFree(c_d));}intmain(int argc, char **argv){ float c[9]; float a[] = { 1, 2, 3, 0, 4, 5, 1, 0, 6 }; run_inv(a, c, 3); for (int i = 0; i < 3; i++){ for (int j = 0; j < 3; j++) printf("%f, ",c[(3*i)+j]); printf("\n");} return 0;}$ nvcc -arch=sm_35 -rdc=true -o t605 t605.cu -lcublas_device -lcudadevrtnvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sgemm.asm.o'nvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sm50_sgemm.o'$ ./t605handle 0 n = 3rf 0 info 0ri 0 info 0done1.090909, -0.545455, -0.090909,0.227273, 0.136364, -0.227273,-0.181818, 0.090909, 0.181818,$ 这篇关于从器件的cublas矩阵求逆的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！