对GPU进行结构的动态数组的内存分配

本文介绍了对GPU进行结构的动态数组的内存分配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有问题，通过结构的阵列GPU内核。我在此基础上的话题 - ，然后我写了某事是这样的：

 的#include＆LT;＆stdio.h中GT;
＃包括LT＆;＆stdlib.h中GT;结构测试{
    的char *数组;
};__global__无效内核（测试* dev_test）{
    的for（int i = 0;我小于5;我++）{
        的printf（内核[0] [我]：％C \\ N，dev_test [0] .array [I]）;
    }
}诠释主要（无效）{    INT n = 4时，大小= 5;
    测试* dev_test，*试验;    测试=（测试*）malloc的（的sizeof（测试）* N）;
    的for（int i = 0; I＆LT; N;我++）
        测试[I] .array =（字符*）malloc的（大小*的sizeof（字符））;    的for（int i = 0; I＆LT; N;我++）{
        焦温度[] = {'A'，'B'，'C'，'D'，'E'};
        的memcpy（测试[I] .array，温度，尺寸*的sizeof（字符））;
    }    cudaMalloc（（无效**）及dev_test，正*的sizeof（试验））;
    cudaMemcpy（dev_test，检验，n * sizeof的（测试），cudaMemcpyHostToDevice）;
    的for（int i = 0; I＆LT; N;我++）{
        cudaMalloc（（无效**）及（测试[I] .array），尺寸*的sizeof（字符））;
        cudaMemcpy（及（dev_test [I] .array），及（测试[I] .array），尺寸*的sizeof（char）的，cudaMemcpyHostToDevice）;
    }    内核与所述;＆所述;＆。1，1＆GT;＆GT;＆GT;（dev_test）;
    cudaDeviceSynchronize（）;    //空闲内存
    返回0;
}

有没有错误，但显示核心价值观是不正确的。我做错了吗？预先感谢任何帮助。

解决方案

这是分配一个新的指针到主机内存：

 测试[I] .array =（字符*）malloc的（大小*的sizeof（字符））;

这是在主机内存将数据复制到该区域的：

 的memcpy（测试[I] .array，温度，尺寸*的sizeof（字符））;

这是覆盖的的previously分配终场前一个的新的的指针设备内存到主机内存（从上面的步骤1）：
```
  cudaMalloc（（无效**）及（测试[I] .array），尺寸*的sizeof（字符））;
 
```

第3步后，在第2步设置数据完全丢失，以及以任何方式不再访问。谈到在您链接步骤3和4：

You haven't done this. You did not create a separate pointer. You reused (erased, overwrote) an existing pointer, which was pointing to data you cared about on the host. This question/answer, also linked from the answer you linked, gives almost exactly the steps you need to follow, in code.

Here's a modified version of your code, which properly implements the missing steps 3 and 4 (and 5) that you didn't implement correctly according to the question/answer you linked: (refer to comments delineating steps 3,4,5)

$ cat t755.cu
#include <stdio.h>
#include <stdlib.h>

struct Test {
    char *array;
};

__global__ void kernel(Test *dev_test) {
    for(int i=0; i < 5; i++) {
        printf("Kernel[0][i]: %c \n", dev_test[0].array[i]);
    }
}

int main(void) {

    int n = 4, size = 5;
    Test *dev_test, *test;

    test = (Test*)malloc(sizeof(Test)*n);
    for(int i = 0; i < n; i++)
        test[i].array = (char*)malloc(size * sizeof(char));

    for(int i=0; i < n; i++) {
        char temp[] = { 'a', 'b', 'c', 'd' , 'e' };
        memcpy(test[i].array, temp, size * sizeof(char));
    }

    cudaMalloc((void**)&dev_test, n * sizeof(Test));
    cudaMemcpy(dev_test, test, n * sizeof(Test), cudaMemcpyHostToDevice);

    // Step 3:
    char *temp_data[n];
    // Step 4:
    for (int i=0; i < n; i++)
      cudaMalloc(&(temp_data[i]), size*sizeof(char));
    // Step 5:
    for (int i=0; i < n; i++)
      cudaMemcpy(&(dev_test[i].array), &(temp_data[i]), sizeof(char *), cudaMemcpyHostToDevice);
    // now copy the embedded data:
    for (int i=0; i < n; i++)
      cudaMemcpy(temp_data[i], test[i].array, size*sizeof(char), cudaMemcpyHostToDevice);

    kernel<<<1, 1>>>(dev_test);
    cudaDeviceSynchronize();

    //  memory free
    return 0;
}

$ nvcc -o t755 t755.cu
$ cuda-memcheck ./t755
========= CUDA-MEMCHECK
Kernel[0][i]: a
Kernel[0][i]: b
Kernel[0][i]: c
Kernel[0][i]: d
Kernel[0][i]: e
========= ERROR SUMMARY: 0 errors
$

Since the above methodology can be challenging for beginners, the usual advice is not to do it, but instead flatten your data structures. Flatten generally means to rearrange the data storage so as to remove the embedded pointers that have to be separately allocated.

A trivial example of flattening this data structure would be to use this instead:

struct Test {
    char array[5];
};

It's recognized of course that this particular approach would not serve every purpose, but it should illustrate the general idea/intent. With that modification, as an example, the code becomes much simpler:

$ cat t755.cu
#include <stdio.h>
#include <stdlib.h>

struct Test {
    char array[5];
};

__global__ void kernel(Test *dev_test) {
    for(int i=0; i < 5; i++) {
        printf("Kernel[0][i]: %c \n", dev_test[0].array[i]);
    }
}

int main(void) {

    int n = 4, size = 5;
    Test *dev_test, *test;

    test = (Test*)malloc(sizeof(Test)*n);

    for(int i=0; i < n; i++) {
        char temp[] = { 'a', 'b', 'c', 'd' , 'e' };
        memcpy(test[i].array, temp, size * sizeof(char));
    }

    cudaMalloc((void**)&dev_test, n * sizeof(Test));
    cudaMemcpy(dev_test, test, n * sizeof(Test), cudaMemcpyHostToDevice);

    kernel<<<1, 1>>>(dev_test);
    cudaDeviceSynchronize();

    //  memory free
    return 0;
}
$ nvcc -o t755 t755.cu
$ cuda-memcheck ./t755
========= CUDA-MEMCHECK
Kernel[0][i]: a
Kernel[0][i]: b
Kernel[0][i]: c
Kernel[0][i]: d
Kernel[0][i]: e
========= ERROR SUMMARY: 0 errors
$

这篇关于对GPU进行结构的动态数组的内存分配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！