问题描述
我有一个 5000x500 的矩阵,我想用 cuda 分别对每一行进行排序.我可以使用 arrayfire,但这只是一个对推力::排序的 for 循环,这应该效率不高.
I have a 5000x500 matrix and I want to sort each row separately with cuda. I can use arrayfire but this is just a for loop over the thrust::sort, which should not be efficient.
https://github.com/arrayfire/arrayfire/blob/devel/src/backend/cuda/kernel/sort.hpp
for(dim_type w = 0; w < val.dims[3]; w++) {
dim_type valW = w * val.strides[3];
for(dim_type z = 0; z < val.dims[2]; z++) {
dim_type valWZ = valW + z * val.strides[2];
for(dim_type y = 0; y < val.dims[1]; y++) {
dim_type valOffset = valWZ + y * val.strides[1];
if(isAscending) {
thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0]);
} else {
thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0],
thrust::greater<T>());
}
}
}
}
有没有办法融合推力操作以便并行运行?事实上,我正在寻找的是一种将 for 循环迭代融合到其中的通用方法.
Is there a way to fuse operations in thrust so as to have the sorts run in parallel? Indeed, what I am looking for is a generic way to fuse for loop iterations into.
推荐答案
我能想到 2 种可能性,@JaredHoberock 已经建议了其中一种.我不知道在推力中融合 for 循环迭代的通用方法,但第二种方法是更通用的方法.我的猜测是,在这种情况下,第一种方法将是两种方法中更快的方法.
I can think of 2 possibilities, one of which is suggested already by @JaredHoberock. I don't know of a general methodology to fuse for-loop iterations in thrust, but the second method is the more general approach. My guess is that the first method would be the faster of the two approaches, in this case.
使用矢量化排序.如果要按嵌套 for 循环排序的区域不重叠,则可以使用 2 个背靠背稳定排序操作进行矢量化排序,如所讨论的 此处.
Thrust v1.8(随 CUDA 7 RC 一起提供,或通过从 thrust github 存储库直接下载 包括 支持嵌套推力算法,通过在其中包含推力算法调用传递给另一个推力算法的自定义函子.如果您使用 thrust::for_each
操作来选择您需要执行的单个排序,您可以使用单个推力算法调用运行这些排序,方法是包括传递给 thrust::for_each
的函子中的 thrust::sort
操作.
Thrust v1.8 (available with CUDA 7 RC, or via direct download from the thrust github repository includes support for nesting thrust algorithms, by including a thrust algorithm call within a custom functor passed to another thrust algorithm. If you use the thrust::for_each
operation to select the individual sorts you need to perform, you can run those sorts with a single thrust algorithm call, by including the thrust::sort
operation in the functor you pass to thrust::for_each
.
以下是 3 种方法的完整比较:
Here's a fully worked comparison between 3 methods:
- 原始的循环排序方法
- 矢量化/批量排序
- 嵌套排序
在每种情况下,我们都对 16000 组相同的 1000 个整数进行排序.
In each case, we're sorting the same 16000 sets of 1000 ints each.
$ cat t617.cu
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>
#define NSORTS 16000
#define DSIZE 1000
int my_mod_start = 0;
int my_mod(){
return (my_mod_start++)/DSIZE;
}
bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2){
return thrust::equal(d1.begin(), d1.end(), d2.begin());
}
struct sort_functor
{
thrust::device_ptr<int> data;
int dsize;
__host__ __device__
void operator()(int start_idx)
{
thrust::sort(thrust::device, data+(dsize*start_idx), data+(dsize*(start_idx+1)));
}
};
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
int main(){
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16*DSIZE*NSORTS));
thrust::host_vector<int> h_data(DSIZE*NSORTS);
thrust::generate(h_data.begin(), h_data.end(), rand);
thrust::device_vector<int> d_data = h_data;
// first time a loop
thrust::device_vector<int> d_result1 = d_data;
thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
unsigned long long mytime = dtime_usec(0);
for (int i = 0; i < NSORTS; i++)
thrust::sort(r1ptr+(i*DSIZE), r1ptr+((i+1)*DSIZE));
cudaDeviceSynchronize();
mytime = dtime_usec(mytime);
std::cout << "loop time: " << mytime/(float)USECPSEC << "s" << std::endl;
//vectorized sort
thrust::device_vector<int> d_result2 = d_data;
thrust::host_vector<int> h_segments(DSIZE*NSORTS);
thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
thrust::device_vector<int> d_segments = h_segments;
mytime = dtime_usec(0);
thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
cudaDeviceSynchronize();
mytime = dtime_usec(mytime);
std::cout << "vectorized time: " << mytime/(float)USECPSEC << "s" << std::endl;
if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
//nested sort
thrust::device_vector<int> d_result3 = d_data;
sort_functor f = {d_result3.data(), DSIZE};
thrust::device_vector<int> idxs(NSORTS);
thrust::sequence(idxs.begin(), idxs.end());
mytime = dtime_usec(0);
thrust::for_each(idxs.begin(), idxs.end(), f);
cudaDeviceSynchronize();
mytime = dtime_usec(mytime);
std::cout << "nested time: " << mytime/(float)USECPSEC << "s" << std::endl;
if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
return 0;
}
$ nvcc -arch=sm_20 -std=c++11 -o t617 t617.cu
$ ./t617
loop time: 8.51577s
vectorized time: 0.068802s
nested time: 0.567959s
$
注意事项:
- 这些结果会因 GPU 而异.
- 嵌套"时间/方法在支持动态并行的 GPU 上可能会有很大差异,因为这会影响推力如何运行嵌套排序函数.要使用动态并行性进行测试,请将编译开关从
-arch=sm_20
更改为-arch=sm_35 -rdc=true -lcudadevrt
- 此代码需要 CUDA 7 RC.我用的是 Fedora 20.
- 嵌套排序方法也会从设备端分配,因此我们必须使用
cudaDeviceSetLimit
大幅增加设备分配堆. - 如果您使用动态并行,并且根据您运行的 GPU 类型,使用
cudaDeviceSetLimit
保留的内存量可能需要增加 8 倍.
- These results will vary significantly from GPU to GPU.
- The "nested" time/method may vary significantly on a GPU that can support dynamic parallelism, as this will affect how thrust runs the nested sort functions. To test with dynamic parallelism, change the compile switches from
-arch=sm_20
to-arch=sm_35 -rdc=true -lcudadevrt
- This code requires CUDA 7 RC. I used Fedora 20.
- The nested sort method also will allocate from the device side, therefore we must substantially increase the device allocation heap using
cudaDeviceSetLimit
. - If you are using dynamic parallelism, and depending on the type of GPU you are running on, the amount of memory reserved with
cudaDeviceSetLimit
may need to be increased perhaps by an additional factor of 8.
这篇关于如何使用 Thrust 对矩阵的行进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!