cublasXt矩阵乘法在C ++中成功，在Python中失败

本文介绍了cublasXt矩阵乘法在C ++中成功，在Python中失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在Ubuntu Linux 16.04上的python 2.7.14中使用ctypess将CUDA 9.0中的 cublasXt * gemm 函数包装起来。这些函数接受主机内存中的数组作为它们的某些参数。我已经能够在C ++中成功使用它们，如下所示：

I'm trying to wrap the cublasXt*gemm functions in CUDA 9.0 with ctypess in Python 2.7.14 on Ubuntu Linux 16.04. These functions accept arrays in host memory as some of their arguments. I have been able to use them successfully in C++ as follows:

#include <iostream>
#include <cstdlib>
#include "cublasXt.h"
#include "cuda_runtime_api.h"

void rand_mat(float* &x, int m, int n) {
    x = new float[m*n];
    for (int i=0; i<m; ++i) {
        for (int j=0; j<n; ++j) {
            x[i*n+j] = ((float)rand())/RAND_MAX;
        }
    }
}

int main(void) {
    cublasXtHandle_t handle;
    cublasXtCreate(&handle);

    int devices[1] = {0};
    if (cublasXtDeviceSelect(handle, 1, devices) !=
        CUBLAS_STATUS_SUCCESS) {
        std::cout << "initialization failed" << std::endl;
        return 1;
    }

    float *a, *b, *c;
    int m = 4, n = 4, k = 4;

    rand_mat(a, m, k);
    rand_mat(b, k, n);
    rand_mat(c, m, n);

    float alpha = 1.0;
    float beta = 0.0;

    if (cublasXtSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                      m, n, k, &alpha, a, m, b, k, &beta, c, m) !=
           CUBLAS_STATUS_SUCCESS) {
        std::cout << "matrix multiply failed" << std::endl;
        return 1;
    }
    delete a; delete b; delete c;
    cublasXtDestroy(handle);
}

但是，当我尝试如下将它们包装在Python中时，遇到一个在 cublasXt * gemm 调用时出现段错误：

However, when I try to wrap them in Python as follows, I encounter a segfault at the cublasXt*gemm call:

import ctypes
import numpy as np

_libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
_libcublas.cublasXtCreate.restype = int
_libcublas.cublasXtCreate.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDestroy.restype = int
_libcublas.cublasXtDestroy.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDeviceSelect.restype = int
_libcublas.cublasXtDeviceSelect.argtypes = [ctypes.c_void_p,
                                            ctypes.c_int,
                                            ctypes.c_void_p]
_libcublas.cublasXtSgemm.restype = int
_libcublas.cublasXtSgemm.argtypes = [ctypes.c_void_p,
                                     ctypes.c_int,
                                     ctypes.c_int,
                                     ctypes.c_int,
                                     ctypes.c_int,
                                     ctypes.c_int,
                                     ctypes.c_void_p,
                                     ctypes.c_void_p,
                                     ctypes.c_int,
                                     ctypes.c_void_p,
                                     ctypes.c_int,
                                     ctypes.c_void_p,
                                     ctypes.c_void_p,
                                     ctypes.c_int]

handle = ctypes.c_void_p()
_libcublas.cublasXtCreate(ctypes.byref(handle))
deviceId = np.array([0], np.int32)
status = _libcublas.cublasXtDeviceSelect(handle, 1,
                                         deviceId.ctypes.data)
if status:
    raise RuntimeError

a = np.random.rand(4, 4).astype(np.float32)
b = np.random.rand(4, 4).astype(np.float32)
c = np.zeros((4, 4), np.float32)

status = _libcublas.cublasXtSgemm(handle, 0, 0, 4, 4, 4,
                                  ctypes.byref(ctypes.c_float(1.0)),
                                  a.ctypes.data, 4, b.ctypes.data, 4,
                                  ctypes.byref(ctypes.c_float(0.0)),
                                  c.ctypes.data, 4)
if status:
    raise RuntimeError
print 'success? ', np.allclose(np.dot(a.T, b.T).T, c_gpu.get())
_libcublas.cublasXtDestroy(handle)

奇怪的是，如果我稍加修改，以接受我转移到的 pycuda.gpuarray.GPUArray 矩阵，上述Python包装器就可以工作GPU。关于为什么在将主机内存传递给函数时为什么只在Python中遇到段错误的想法？

Curiously, the Python wrappers above work if I slightly modify them to accept pycuda.gpuarray.GPUArray matrices that I have transferred to the GPU. Any thoughts as to why I am encountering a segfault only in Python when passing host memory to the function?

推荐答案

在CUBLAS文档中，这些 Xt< t> gemm 函数。至少从CUDA 8开始，参数 m ， n ， k ， lda ， ldb ， ldc 都是 size_t 类型。可以通过查看头文件 cublasXt.h 来发现。

There appear to be errors in the CUBLAS documentation for these Xt<t>gemm functions. Starting at least with CUDA 8, the parameters m,n,k,lda,ldb,ldc are all of type size_t. This can be discovered by looking at the header file cublasXt.h.

对包装程序的以下修改似乎可以为我正确工作：

The following modification of your wrapper seems to work correctly for me:

$ cat t1340.py
import ctypes
import numpy as np

_libcublas = ctypes.cdll.LoadLibrary('libcublas.so')
_libcublas.cublasXtCreate.restype = int
_libcublas.cublasXtCreate.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDestroy.restype = int
_libcublas.cublasXtDestroy.argtypes = [ctypes.c_void_p]
_libcublas.cublasXtDeviceSelect.restype = int
_libcublas.cublasXtDeviceSelect.argtypes = [ctypes.c_void_p,
                                            ctypes.c_int,
                                            ctypes.c_void_p]
_libcublas.cublasXtSgemm.restype = int
_libcublas.cublasXtSgemm.argtypes = [ctypes.c_void_p,
                                     ctypes.c_int,
                                     ctypes.c_int,
                                     ctypes.c_size_t,
                                     ctypes.c_size_t,
                                     ctypes.c_size_t,
                                     ctypes.c_void_p,
                                     ctypes.c_void_p,
                                     ctypes.c_size_t,
                                     ctypes.c_void_p,
                                     ctypes.c_size_t,
                                     ctypes.c_void_p,
                                     ctypes.c_void_p,
                                     ctypes.c_size_t]

handle = ctypes.c_void_p()
_libcublas.cublasXtCreate(ctypes.byref(handle))
deviceId = np.array([0], np.int32)
status = _libcublas.cublasXtDeviceSelect(handle, 1,
                                         deviceId.ctypes.data)
if status:
    raise RuntimeError

a = np.random.rand(4, 4).astype(np.float32)
b = np.random.rand(4, 4).astype(np.float32)
c = np.zeros((4, 4), np.float32)
alpha = ctypes.c_float(1.0)
beta = ctypes.c_float(0.0)

status = _libcublas.cublasXtSgemm(handle, 0, 0, 4, 4, 4,
                                 ctypes.byref(alpha),
                                 a.ctypes.data, 4, b.ctypes.data, 4,
                                 ctypes.byref(beta),
                                 c.ctypes.data, 4)
if status:
    raise RuntimeError
print 'success? ', np.allclose(np.dot(a.T, b.T).T, c)
_libcublas.cublasXtDestroy(handle)
$ python t1340.py
success?  True
$

枚举我所做的更改：

将 m ， argtypes 更改为> n ， k ， lda ， ldb c_int 的 cublasXtSgemm 的c $ c>， ldc 参数为 c_size_t

为alpha和beta参数提供显式变量；这可能与您的 np.allclose 函数中的
无关，已更改为 c_gpu.get 只是 c

changed argtypes for the m,n,k,lda,ldb,ldc parameters for cublasXtSgemm from c_int to c_size_t
provided explicit variables for your alpha and beta arguments; this is probably irrelevant
in your np.allclose function, changed c_gpu.get to just c

以上内容已在CUDA 8和CUDA上进行了测试9.我已向NVIDIA提交了一个内部错误，以更新文档（即使当前的CUDA 9文档也无法反映头文件的当前状态。）

The above was tested on CUDA 8 and CUDA 9. I have filed an internal bug with NVIDIA to have the docs updated (even current CUDA 9 docs do not reflect the current state of the header files.)

这篇关于cublasXt矩阵乘法在C ++中成功，在Python中失败的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！