问题描述
我找到了答案这里,但是尚不清楚我是否应该调整数组的形状.在将2d数组传递给pycuda内核之前,是否需要将其转换为1d形状?
I found an answer here, but it is not clear if I should reshape the array. Do I need to reshape the 2d array into 1d before passing it to pycuda kernel?
推荐答案
无需重塑2D gpuarray
即可将其传递给CUDA内核.
There is no need to reshape a 2D gpuarray
in order to pass it to a CUDA kernel.
正如我在您链接的答案中所说的那样,二维numpy或PyCUDA数组只是倾斜的线性内存的分配,默认情况下按行主要顺序存储.两者都有两个成员,它们告诉您访问数组所需的所有内容-shape
和strides
.例如:
As I said in the answer you linked to, a 2D numpy or PyCUDA array is just an allocation of pitched linear memory, stored in row major order by default. Both have two members which tell you everything that you need to access an array - shape
and strides
. For example:
In [8]: X=np.arange(0,15).reshape((5,3))
In [9]: print X.shape
(5, 3)
In [10]: print X.strides
(12, 4)
形状是不言自明的,步幅是存储的节距(以字节为单位).内核代码的最佳做法是将PyCUDA提供的指针视为使用 cudaMallocPitch
并将stride
的第一个元素视为内存中行的字节间距.一个简单的示例可能看起来像这样:
The shape is self explanatory, the stride is the pitch of the storage in bytes. The best practice for kernel code will be to treat the pointer supplied by PyCUDA as if it were allocated using cudaMallocPitch
and treat the first element of stride
as the byte pitch of the rows in memory. A trivial example might look like this:
import pycuda.driver as drv
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy as np
mod = SourceModule("""
__global__ void diag_kernel(float *dest, int stride, int N)
{
const int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < N) {
float* p = (float*)((char*)dest + tid*stride) + tid;
*p = 1.0f;
}
}
""")
diag_kernel = mod.get_function("diag_kernel")
a = np.zeros((10,10), dtype=np.float32)
a_N = np.int32(a.shape[0])
a_stride = np.int32(a.strides[0])
a_bytes = a.size * a.dtype.itemsize
a_gpu = drv.mem_alloc(a_bytes)
drv.memcpy_htod(a_gpu, a)
diag_kernel(a_gpu, a_stride, a_N, block=(32,1,1))
drv.memcpy_dtoh(a, a_gpu)
print a
这里在设备上分配了一些内存,将调零的2D数组直接复制到该分配,然后将内核(用1填充对角线)的结果复制回主机并打印.无需在过程中的任何时候展平或修改2D numpy数据的形状或内存布局.结果是:
Here some memory is allocated on the device, a zeroed 2D array is copied to that allocation directly, and the result of the kernel (filling the diagonals with 1) copied back to the host and printed. It isn't necessary to flatten or otherwise modify the shape or memory layout of the 2D numpy data at any point in the process. The result is:
$ cuda-memcheck python ./gpuarray.py
========= CUDA-MEMCHECK
[[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
========= ERROR SUMMARY: 0 errors
这篇关于如何将二维数组传递到pycuda的内核中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!