



我是GPU编程的新手。最近,我试图实现基于教程的gpu bvh构建算法:。在该算法的第一步中,计算和排序每个基元的密码(unsigned int)。本教程给出了12K对象的morton代码计算和排序的参考时间成本:

I'm a newbie to GPU programming. Recently, I'm trying to implement the gpu bvh construction algorithm based on an tutorial: http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/. In the first step of this algorithm, the morton code(unsigned int) of every primitive is computed and sorted. The tutorial gives a reference time cost of the morton code computing and sorting for 12K objects:

  1. 0.02 ms,每个对象一个线程:计算边界

  2. 0.18 ms,平行基数排序:根据对象的Morton代码对对象进行排序。


In my implementation, the first step cost 0.1ms and the sorting step costs 1.8ms. I'm using thrust to do the sorting. So what is the fastest implementation of radix sort on GPU?

我使用的Geforce Titan GPU应该比本教程作者使用的Geforce GTX690更快。

I'm using a Geforce Titan GPU which should faster than the Geforce GTX690 used by the author of the tutorial.Here is my test code for sorting, it costs about 1.5ms even when the size is 10.

void testSort()
    int sz = 10;
    thrust::host_vector<unsigned int> h_keys(sz);
    for(int i=0; i<sz; i++)
        h_keys[i] = rand();
    thrust::device_ptr<unsigned int> keys = thrust::device_malloc<unsigned int>(sz);
    cudaEvent_t estart, estop;
    cudaEventCreate( &estart );
    cudaEventCreate( &estop );
    cudaEventRecord( estart, 0 );
    cudaEventRecord( estop, 0 ) ;
    cudaEventSynchronize( estop );
    float elapsedTime;
    cudaEventElapsedTime( &elapsedTime,
        estart, estop ) ;
    printf( "Time to sort: %3.1f ms\n", elapsedTime );
    cudaEventDestroy( estart ) ;
    cudaEventDestroy( estop ) ;


for GPGPU by 。他们提供了一个性能比较图表,他们声称他们的实施是最快的。

There is a Radix sort implementation for GPGPU by back40computing. They provide a performance comparison chart with which they claim that their implementation is the fastest.


09-05 03:49