问题描述
与我面临的一个问题想一些数据从主机复制到GPU的CUDA编程。
Programming with CUDA I am facing a problem trying to copy some data from host to gpu.
我有3个嵌套的结构这样的:
I have 3 nested struct like these:
typedef struct {
char data[128];
short length;
} Cell;
typedef struct {
Cell* elements;
int height;
int width;
} Matrix;
typedef struct {
Matrix* tables;
int count;
} Container;
所以集装箱
包括一些矩阵
元素,而这又包括一些细胞
元素。
So Container
"includes" some Matrix
elements, which in turn includes some Cell
elements.
让我们假设我这样动态分配主机内存:
Let's suppose I dynamically allocate the host memory in this way:
Container c;
c.tables = malloc(20 * sizeof(Matrix));
for(int i = 0;i<20;i++){
Matrix m;
m.elements = malloc(100 * sizeof(Cell));
c.tables[i] = m;
}
即,20矩阵每100细胞的容器
That is, a Container of 20 Matrix of 100 Cells each.
- 我怎么能现在用这个cudaMemCpy数据复制到设备存储器()?
- 是否有从主机执行结构的结构深拷贝到设备什么好的办法?
感谢您的时间。
安德烈
推荐答案
简短的回答是就是不。有四个原因,我说:
The short answer is "just don't". There are four reasons why I say that:
- 有一个在API 没有深刻的复制功能
- 由此产生的code,你将不得不的writeTo设置和复制你所描述的GPU的结构将至少可笑复(约4000 API调用,并可能是一个中间的内核为你的20基质100例如细胞)
- 使用三级间接指针的GPU code将正大量地增加内存访问延迟,并会破坏什么小高速缓存一致性可在GPU上
- 如果您希望将数据传回后复制到主机,你有反向 同样的问题
- There is no deep copy functionality in the API
- The resulting code you will have to writeto set up and copy the structure you have described to the GPU will be ridiculously complex (about 4000 API calls at a minimum, and probably an intermediate kernel for your 20 Matrix of 100 Cells example)
- The GPU code using three levels of pointer indirection will have massively increased memory access latency and will break what little cache coherency is available on the GPU
- If you want to copy the data back to the host afterwards, you have the same problem in reverse
考虑使用线性存储和索引来代替。它是主机和GPU,以及分配之间便携式和复制的开销是指针基替代的约1%。
Consider using linear memory and indexing instead. It is portable between host and GPU, and the allocation and copy overhead is about 1% of the pointer based alternative.
如果您真的要做到这一点,发表评论,我会尽量挖掘一些老code例子,说明什么是完整的愚蠢嵌套指针在GPU上。
If you really want to do this, leave a comment and I will try and dig up some old code examples which show what a complete folly nested pointers are on the GPU.
这篇关于如何使用CUDA进行结构的深度复制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!