问题描述
我写了这样的示例代码。
int ** d_ptr;
cudaMalloc((void **)& d_ptr,sizeof(int *)* N);
int * tmp_ptr [N];
for(int i = 0; i cudaMalloc((void **)& tmp_ptr [i],sizeof(int)* SIZE);
cudaMemcpy(d_ptr,tmp_ptr,sizeof(tmp_ptr),cudaMemcpyHostToDevice);
这个代码运行得很好,但是在内核启动后,我无法接收结果。
int * Mtx_on_GPU [N];
cudaMemcpy(Mtx_on_GPU,d_ptr,sizeof(int)* N * SIZE,cudaMemcpyDeviceToHost);
此时,发生段错误错误。但我不知道我错了什么。
int * Mtx_on_GPU [N];
for(int i = 0; i cudaMemcpy(Mtx_on_GPU [i],d_ptr [i],sizeof(int)* SIZE,cudaMemcpyDeviceToHost);
此代码也有同样的错误。
我认为我的代码有一些错误,但我在白天无法找到它。
$
cudaMemcpy(Mtx_on_GPU [i],d_ptr [i],sizeof(int)* SIZE,cudaMemcpyDeviceToHost);
您正在尝试将数据从设备复制到主机(注意:
内存,所以你不能直接从主机端访问。该行应为cudaMemcpy(Mtx_on_GPU [i],temp_ptr [i],sizeof(int)* SIZE,cudaMemcpyDeviceToHost);
这可能会变得更清楚当使用变量名:
int ** devicePointersStoredInDeviceMemory;
cudaMalloc((void **)& devicePointersStoredInDeviceMemory,sizeof(int *)* N);
int * devicePointersStoredInHostMemory [N];
for(int i = 0; i cudaMalloc((void **)& devicePointersStoredInHostMemory [i],sizeof(int)* SIZE);
cudaMemcpy(
devicePointersStoredInDeviceMemory,
devicePointersStoredInHostMemory,
sizeof(int *)* N,cudaMemcpyHostToDevice);
//在这里调用内核,传递devicePointersStoredInDeviceMemory
//作为参数
...
int * hostPointersStoredInHostMemory [N];
for(int i = 0; i int * hostPointer = hostPointersStoredInHostMemory [i];
//(为hostPointer分配内存!)
int * devicePointer = devicePointersStoredInHostMemory [i];
cudaMemcpy(hostPointer,devicePointer,sizeof(int)* SIZE,cudaMemcpyDeviceToHost);
}
comment:
d_ptr
是一个指针数组。但是该数组的内存分配有cudaMalloc
。这意味着它位于设备上。与此相反,使用int * Mtx_on_GPU [N];
您在主机内存中分配N个指针。而不是指定数组大小,您也可以使用malloc
。在比较以下分配时,它可能变得更清楚:int ** pointersStoredInDeviceMemory;
cudaMalloc((void **)& pointersStoredInDeviceMemory,sizeof(int *)* N);
int ** pointersStoredInHostMemory;
pointersStoredInHostMemory =(void **)malloc(N * sizeof(int *));
//这是不可能的,因为数组分配有cudaMalloc:
int * pointerA = pointersStoredInDeviceMemory [0];
//这是可能的,因为数组被分配了malloc:
int * pointerB = pointersStoredInHostMemory [0];
这可能有点扭曲以跟踪
- 存储指针存储的内存类型
- 指针的内存类型指向
,但幸运的是,它几乎不会超过2个边界。
I wrote my sample code like this.
int ** d_ptr; cudaMalloc( (void**)&d_ptr, sizeof(int*)*N ); int* tmp_ptr[N]; for(int i=0; i<N; i++) cudaMalloc( (void**)&tmp_ptr[i], sizeof(int)*SIZE ); cudaMemcpy(d_ptr, tmp_ptr, sizeof(tmp_ptr), cudaMemcpyHostToDevice);
And this code works well but after kernel launching I can't receive the result.
int* Mtx_on_GPU[N]; cudaMemcpy(Mtx_on_GPU, d_ptr, sizeof(int)*N*SIZE, cudaMemcpyDeviceToHost);
At this point, segment-fault-error occurs. But I don't know what I'm wrong.
int* Mtx_on_GPU[N]; for(int i=0; i<N; i++) cudaMemcpy(Mtx_on_GPU[i], d_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
This code have also same error.
I think certainly my code has some mistakes but I can't find it during all daytime.
Give me some advice.
解决方案In the last line
cudaMemcpy(Mtx_on_GPU[i], d_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
you are trying to copy data from the device to the host (NOTE: I assume that you allocated host memory for the
Mtx_on_GPU
pointers!)However, the pointers are stored in device memory, so you can't access the directly from host side. The line should be
cudaMemcpy(Mtx_on_GPU[i], temp_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
This may become clearer when using "overly elaborate" variable names:
int ** devicePointersStoredInDeviceMemory; cudaMalloc( (void**)&devicePointersStoredInDeviceMemory, sizeof(int*)*N); int* devicePointersStoredInHostMemory[N]; for(int i=0; i<N; i++) cudaMalloc( (void**)&devicePointersStoredInHostMemory[i], sizeof(int)*SIZE ); cudaMemcpy( devicePointersStoredInDeviceMemory, devicePointersStoredInHostMemory, sizeof(int*)*N, cudaMemcpyHostToDevice); // Invoke kernel here, passing "devicePointersStoredInDeviceMemory" // as an argument ... int* hostPointersStoredInHostMemory[N]; for(int i=0; i<N; i++) { int* hostPointer = hostPointersStoredInHostMemory[i]; // (allocate memory for hostPointer here!) int* devicePointer = devicePointersStoredInHostMemory[i]; cudaMemcpy(hostPointer, devicePointer, sizeof(int)*SIZE, cudaMemcpyDeviceToHost); }
EDIT in response to the comment:
The
d_ptr
is "an array of pointers". But the memory of this array is allocated withcudaMalloc
. That means that it is located on the device. In contrast to that, withint* Mtx_on_GPU[N];
you are "allocating" N pointers in host memory. Instead of specifying the array size, you could also have usedmalloc
. It may become clearer when you compare the following allocations:int** pointersStoredInDeviceMemory; cudaMalloc((void**)&pointersStoredInDeviceMemory, sizeof(int*)*N); int** pointersStoredInHostMemory; pointersStoredInHostMemory = (void**)malloc(N * sizeof(int*)); // This is not possible, because the array was allocated with cudaMalloc: int *pointerA = pointersStoredInDeviceMemory[0]; // This is possible because the array was allocated with malloc: int *pointerB = pointersStoredInHostMemory[0];
It may be a little bit brain-twisting to keep track of
- the type of the memory where the pointers are stored
- the type of the memory that the pointers are pointing to
but fortunately, it hardly becomes more than 2 indirections.
这篇关于CUDA双指针内存复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!