如何在 CUDA 中的 CPU/GPU 之间分配内存和复制 2D 数组而不使它们变平?

本文介绍了如何在 CUDA 中的 CPU/GPU 之间分配内存和复制 2D 数组而不使它们变平?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我想在CUDA中分配2D数组并在CPU和GPU之间复制它们，但我是一个完全初学者，其他在线资料对我来说很难理解或不完整.重要的是我能够在内核代码中以二维数组的形式访问它们，如下所示.

So I want to allocate 2D arrays and also copy them between the CPU and GPU in CUDA, but I am a total beginner and other online materials are very difficult for me to understand or are incomplete. It is important that I am able to access them as a 2D array in the kernel code as shown below.

请注意，数组的高度 != 宽度，如果可能的话，这会让我更加困惑，因为我总是难以选择网格大小.

Note that height != width for the arrays, that's something that further confuses me if it's possible as I always struggle choosing grid size.

我考虑过将它们展平，但我真的想让它以这种方式工作.

I've considered flattening them, but I really want to get it working this way.

这是我自己研究的结果.

This is how far I've got by my own research.

__global__ void myKernel(int *firstArray, int *secondArray, int rows, int columns) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int column = blockIdx.y * blockDim.y + threadIdx.y;

    if (row >= rows || column >= columns)
        return;

    // Do something with the arrays like you would on a CPU, like:
    firstArray[row][column] = row * 2;
    secondArray[row[column] = row * 3;
}


int main() {
    int rows = 300, columns = 200;
    int h_firstArray[rows][columns], h_secondArray[rows][columns];
    int *d_firstArray[rows][columns], *d_secondArray[rows][columns];

    // populate h_ arrays (Can do this bit myself)

    // Allocate memory on device, no idea how to do for 2D arrays.
    // Do memcopies to GPU, no idea how to do for 2D arrays.

    dim3 block(rows,columns);
    dim3 grid (1,1);
    myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);

    // Do memcopies back to host, no idea how to do for 2D arrays.

    cudaFree(d_firstArray);
    cudaFree(d_secondArray);

    return 0;
}

有人问我在我试图解决的问题中是否会在编译时知道数组宽度.您可以假设我目前主要对这种特殊情况感兴趣.

I was asked if the array width will be known at compile time in the problems I would try to solve. You can assume it is as I'm interested primarily in this particular situation as of now.

推荐答案

在一般情况下(数组维度直到运行时才知道)，在 CUDA 设备代码中处理双下标访问需要一个指针数组，就像它在主机代码.C 和 C++ 将每个下标作为指针解引用处理，以便到达二维数组"中的最终位置.

In the general case (array dimensions not known until runtime), handling doubly-subscripted access in CUDA device code requires an array of pointers, just as it does in host code. C and C++ handle each subscript as a pointer dereference, in order to reach the final location in the "2D array".

一般情况下设备代码中的双指针/双下标访问已包含在链接的规范答案中来自 cuda 标签信息页面.这样做有几个缺点，该答案中已涵盖，因此我不会在这里重复.

Double-pointer/doubly-subscripted access in device code in the general case is already covered in the canonical answer linked from the cuda tag info page. There are several drawbacks to this, which are covered in that answer so I won't repeat them here.

但是，如果数组 width 在编译时已知(数组高度可以是动态的 - 即在运行时确定)，那么我们可以利用编译器和语言类型机制来规避大多数缺点.您的代码演示了 CUDA 和/或 C/C++ 使用的其他几种错误模式:

However, if the array width is known at compile time (array height can be dynamic - i.e. determined at runtime), then we can leverage the compiler and the language typing mechanisms to allow us to circumvent most of the drawbacks. Your code demonstrates several other incorrect patterns for CUDA and/or C/C++ usage:

通过像 int *firstarray
通过基于堆栈的机制分配大型主机阵列:

Passing an item for doubly-subscripted access to a C or C++ function cannot be done with a simple single pointer type like int *firstarray
Allocating large host arrays via stack-based mechanisms:

int h_firstArray[rows][columns], h_secondArray[rows][columns];

在 C 和 C++ 中经常出现问题.这些是基于堆栈的变量，如果足够大，通常会遇到堆栈限制.

is often problematic in C and C++. These are stack based variables and will often run into stack limits if large enough.

CUDA 线程块总数限制为 1024 个线程.因此这样的线程块维度:

CUDA threadblocks are limited to 1024 threads total. Therefore such a threadblock dimension:

dim3 block(rows,columns);

除了非常小的 rows 和 columns (乘积必须小于或等于 1024)之外，将不起作用.

will not work except for very small sizes of rows and columns (the product must be less than or equal to 1024).

在 CUDA 中为设备数组声明指针变量时，创建指针数组几乎是不正确的:

When declaring pointer variables for a device array in CUDA, it is almost never correct to create arrays of pointers:

int *d_firstArray[rows][columns], *d_secondArray[rows][columns];

我们也不在主机上分配空间，然后重新分配"这些指针以供设备使用.

nor do we allocate space on the host, then "reallocate" those pointers for device usage.

以下是一个工作示例，其中解决了上述项目并演示了上述方法，其中数组宽度在运行时是已知的:

What follows is a worked example with the above items addressed and demonstrating the aforementioned method where the array width is known at runtime:

$ cat t50.cu
#include <stdio.h>

const int array_width = 200;

typedef int my_arr[array_width];

__global__ void myKernel(my_arr *firstArray, my_arr *secondArray, int rows, int columns) {
    int column = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (row >= rows || column >= columns)
        return;

    // Do something with the arrays like you would on a CPU, like:
    firstArray[row][column] = row * 2;
    secondArray[row][column] = row * 3;
}


int main() {
    int rows = 300, columns = array_width;
    my_arr *h_firstArray, *h_secondArray;
    my_arr *d_firstArray, *d_secondArray;
    size_t dsize = rows*columns*sizeof(int);
    h_firstArray = (my_arr *)malloc(dsize);
    h_secondArray = (my_arr *)malloc(dsize);
    // populate h_ arrays
    memset(h_firstArray, 0, dsize);
    memset(h_secondArray, 0, dsize);

    // Allocate memory on device
    cudaMalloc(&d_firstArray, dsize);
    cudaMalloc(&d_secondArray, dsize);
    // Do memcopies to GPU
    cudaMemcpy(d_firstArray, h_firstArray, dsize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_secondArray, h_secondArray, dsize, cudaMemcpyHostToDevice);

    dim3 block(32,32);
    dim3 grid ((columns+block.x-1)/block.x,(rows+block.y-1)/block.y);
    myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);

    // Do memcopies back to host
    cudaMemcpy(h_firstArray, d_firstArray, dsize, cudaMemcpyDeviceToHost);
    cudaMemcpy(h_secondArray, d_secondArray, dsize, cudaMemcpyDeviceToHost);
    // validate
    if (cudaGetLastError() != cudaSuccess) {printf("cuda error
"); return -1;}
    for (int i = 0; i < rows; i++)
      for (int j = 0; j < columns; j++){
        if (h_firstArray[i][j] != i*2) {printf("first mismatch at %d,%d, was: %d, should be: %d
", i,j,h_firstArray[i][j], i*2); return -1;}
        if (h_secondArray[i][j] != i*3) {printf("second mismatch at %d,%d, was: %d, should be: %d
", i,j,h_secondArray[i][j], i*3); return -1;}}

    printf("success!
");


    cudaFree(d_firstArray);
    cudaFree(d_secondArray);

    return 0;
}
$ nvcc -arch=sm_61 -o t50 t50.cu
$ cuda-memcheck ./t50
========= CUDA-MEMCHECK
success!
========= ERROR SUMMARY: 0 errors
$

我已经颠倒了内核索引 (x,y) 的含义，以帮助合并全局内存访问.我们看到，通过这种类型的创建，我们可以利用编译器和语言特性来最终得到一个允许在主机和设备代码中进行双下标访问的代码，同时允许 CUDA 操作(例如 cudaMemcpy)就好像我们在处理单指针(例如扁平化")数组一样.

I've reversed the sense of your kernel indexing (x,y) to help with coalesced global memory access. We see that with this kind of type creation, we can leverage the compiler and the language features to end up with a code that allows for doubly-subscripted access in both host and device code, while otherwise allowing CUDA operations (e.g. cudaMemcpy) as if we are dealing with single-pointer (e.g. "flattened") arrays.

这篇关于如何在 CUDA 中的 CPU/GPU 之间分配内存和复制 2D 数组而不使它们变平?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！