关键:从简单开始,衡量性能,然后决定如何优化.请注意,此计算看起来与直方图计算非常相似,而直方图计算也面临类似的挑战,因此您可能需要通过Google搜索GPU直方图来查看其实现方式.一个想法是使用thrust::sort_by_key()对(rInd[m],m)对进行排序,然后(由于rInd重复项将被分组在一起),您可以遍历它们并进行还原,而不会发生冲突. (这是一种制作直方图的方法.)您甚至可以使用thrust::reduce_by_key()来做到这一点.I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation suggestions.// macro that translates 2D [i][j] array indices to 1D flattened array indices#define idx(i,j,lda) ( (j) + ((i)*(lda)) )int Nobs = 60480;int NgS = 1859;int NgO = 900;// ngo goes from [1,900]// rInd is an initialized (and filled earlier) as:// rInd = new long int [Nobs];for (m=0; m<Nobs; m++) { ngo=rInd[m]-1; for (n=0; n<NgS; n++) { Aggregation[idx(n,ngo,NgO)] += precalculatedValue; }}In a previous case, when precalculatedValue was only a function of the inner loop variable, I saved the values in unique array indices and added them with a parallel reduction (Thrust) after the fact. However, this case has me stumped: the values of m are not uniquely mapped to the values of ngo. Thus, I don't see a way of making this code efficient (or even workable) to use a reduction on. Any ideas are most welcome. 解决方案 Just a stab...I suspect that transposing your loops might help.for (n=0; n<NgS; n++) { for (m=0; m<Nobs; m++) { ngo=rInd[m]-1; Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n); }}The reason I did this is because idx varies more rapidly with ngo (function of m) than with n, so making m the inner loop improves coherence. Note I also made precalculatedValue a function of (m, n) because you said that it is -- this makes the pseudocode clearer.Then, you could start by leaving the outer loop on the host, and making a kernel for the inner loop (64,480-way parallelism is enough to fill most current GPUs).In the inner loop, just start by using an atomicAdd() to handle collisions. If they are infrequent, on Fermi GPUs performance shouldn't be too bad. In any case, you will be bandwidth bound since arithmetic intensity of this computation is low. So once this is working, measure the bandwidth you are achieving, and compare to the peak for your GPU. If you are way off, then think about optimizing further (perhaps by parallelizing the outer loop -- one iteration per thread block, and do the inner loop using some shared memory and thread cooperation optimizations).The key: start simple, measure performance, and then decide how to optimize.Note that this calculation looks very similar to a histogram calculation, which has similar challenges, so you might want to google for GPU histograms to see how they have been implemented.One idea is to sort (rInd[m], m) pairs using thrust::sort_by_key() and then (since the rInd duplicates will be grouped together), you can iterate over them and do your reductions without collisions. (This is one way to do histograms.) You could even do this with thrust::reduce_by_key(). 这篇关于嵌套循环中数组的二维累积和-CUDA实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-19 12:43