问题描述
我有一台配备NVIDIA GT750M 4Gb(计算能力3.0)显卡的笔记本电脑。我需要排序CUDA上的结构数组(约3×10 ^ 7个元素)。但我不知道如何,因为我没有足够的经验在CUDA。当使用 thrust :: sort 我得到奇怪的结果(需要几十分钟,而 std :: sort 1 分钟)。
I have a laptop with an NVIDIA GT750M 4Gb (compute capability 3.0) graphics card. I need to sort an array of structures on CUDA (about 3 × 10^7 elements). But I cannot figure out how, since I do not have enough experience in CUDA. When using thrust::sort I get strange results (it takes a few tens of minutes, while std::sort takes 1 minute).
struct MyStruct { float key; float a; float b; int c; int d; int e; int f; bool flag; } bool minCompare(const MyStruct lhs, const MyStruct rhs) { return lhs.key < rhs.key; }
推荐答案
Robert Crovella指出在他的评论中,分钟的帐篷很可能意味着你做错了。我在下面提供一个例子,其中我比较使用 thrust :: sort 和排序结构数组(AoS)和数组结构(SoA) code> thrust :: sort_by_key 。我在笔记本电脑上运行GeForce GT 540M和CUDA 5.5编译,所以你有一个比我更强大的卡。对于 100000 元素,两种情况下的执行时间都是秒的数量级。正如我在我的评论中指出的,第一种情况在计算时间( 1675ms )比第二种( 668.9ms )。
As Robert Crovella has pointed out in his comment, tents of minutes most likely means that you are doing something wrong. I'm providing an example below in which I compare the performance of sorting an Array of Structures (AoS) and a Structure of Arrays (SoA) using thrust::sort and thrust::sort_by_key. I'm running on a laptop GeForce GT 540M and compiling with CUDA 5.5, so you have a more powerful card than mine. For 100000 elements the execution time is of the order of seconds in both cases. As I pointed out in my comment, the first case is more demanding in terms of computation time (1675ms) than the second (668.9ms).
#include <thrust\device_vector.h> #include <thrust\sort.h> struct MyStruct1 { int key; int value1; int value2; }; struct MyStruct2 { int N; int* key; int* value1; int* value2; MyStruct2(int N_) { N = N_; cudaMalloc((void**)&key,N*sizeof(int)); cudaMalloc((void**)&value1,N*sizeof(int)); cudaMalloc((void**)&value2,N*sizeof(int)); } }; __host__ __device__ bool operator<(const MyStruct1 &lhs, const MyStruct1 &rhs) { return (lhs.key < rhs.key); }; void main(void) { const int N = 10000; float time; cudaEvent_t start, stop; /*******************************/ /* SORTING ARRAY OF STRUCTURES */ /*******************************/ thrust::host_vector<MyStruct1> h_struct1(N); for (int i = 0; i<N; i++) { MyStruct1 s; s.key = rand()*255; s.value1 = rand()*255; s.value2 = rand()*255; h_struct1[i] = s; } thrust::device_vector<MyStruct1> d_struct(h_struct1); cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); thrust::sort(d_struct.begin(), d_struct.end()); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Sorting array of structs - elapsed time: %3.1f ms \n", time); h_struct1 = d_struct; //for (int i = 0; i<N; i++) //{ // MyStruct1 s = h_struct1[i]; // printf("key %i value1 %i value2 %i\n",s.key,s.value1,s.value2); //} //printf("\n\n"); /*******************************/ /* SORTING STRUCTURES OF ARRAYS*/ /*******************************/ MyStruct2 d_struct2(N); thrust::host_vector<int> h_temp_key(N); thrust::host_vector<int> h_temp_value1(N); thrust::host_vector<int> h_temp_value2(N); //for (int i = 0; i<N; i++) //{ // h_temp_key[i] = rand()*255; // h_temp_value1[i] = rand()*255; // h_temp_value2[i] = rand()*255; // printf("Original data - key %i value1 %i value2 %i\n",h_temp_key[i],h_temp_value1[i],h_temp_value2[i]); //} //printf("\n\n"); cudaMemcpy(d_struct2.key,h_temp_key.data(),N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_struct2.value1,h_temp_value1.data(),N*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_struct2.value2,h_temp_value2.data(),N*sizeof(int),cudaMemcpyHostToDevice); // wrap raw pointers with device pointers cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); thrust::device_ptr<int> dev_ptr_key = thrust::device_pointer_cast(d_struct2.key); thrust::device_ptr<int> dev_ptr_value1 = thrust::device_pointer_cast(d_struct2.value1); thrust::device_ptr<int> dev_ptr_value2 = thrust::device_pointer_cast(d_struct2.value2); thrust::device_vector<int> d_indices(N); thrust::sequence(d_indices.begin(), d_indices.end(), 0, 1); // first sort the keys and indices by the keys thrust::sort_by_key(dev_ptr_key, dev_ptr_key + N, d_indices.begin()); // Now reorder the ID arrays using the sorted indices thrust::gather(d_indices.begin(), d_indices.end(), dev_ptr_value1, dev_ptr_value1); thrust::gather(d_indices.begin(), d_indices.end(), dev_ptr_value2, dev_ptr_value2); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Sorting struct of arrays - elapsed time: %3.1f ms \n", time); cudaMemcpy(h_temp_key.data(),d_struct2.key,N*sizeof(int),cudaMemcpyDeviceToHost); cudaMemcpy(h_temp_value1.data(),d_struct2.value1,N*sizeof(int),cudaMemcpyDeviceToHost); cudaMemcpy(h_temp_value2.data(),d_struct2.value2,N*sizeof(int),cudaMemcpyDeviceToHost); //for (int i = 0; i<N; i++) printf("Ordered data - key %i value1 %i value2 %i\n",h_temp_key[i],h_temp_value1[i],h_temp_value2[i]); //printf("\n\n"); getchar(); }
为了简单起见,我略过了添加正确的CUDA错误检查在。
这篇关于排序CUDA中的结构数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!