问题描述
你好,我一直在用C ++ AMP编写程序,虽然我的代码比第一个串行版本快很多,但最多仅比在ATI Radeon 5870(移动版)上使用WARP快2倍.
Hello I've been writing my program in C++ AMP and while my code is a lot faster than the first serial version, it is only at best 2x faster than using WARP on an ATI Radeon 5870 (mobile).
问题是要针对不同的初始数据多次求解(L + 6)个方程(L = 1,2,...,8).主要的计算工作是反复调用此函数:
The problem is to solve (L+6) equations (L = 1,2,...,8) lots of times for different initial data. The main computational effort is spent repeatedly calling this function:
//performs an RKCK step for mode functions
template<int L> void RKCK(array_view<float_2, 3> Parameters, array_view<float_2, 2> N_array, array_view<float, 2> K_array, amp_extent_2 &Traj_Modes, float TEMP, int Stage){
//loop over all trajectories
parallel_for_each(Traj_Modes, [=](index<2> TrajMode)restrict(amp){
float N = N_array[TrajMode].x;
float K = K_array[TrajMode];
float H = Parameters[TrajMode[0]]TrajMode[1]][L].x;
if(K > TEMP*expf(N)*H && N > -0.9f){
float_2 Param_old[L+6];
float dN = N_array[TrajMode].y;
for(int l = 0; l < L+6; l++) Param_old[l] = Parameters[TrajMode[0]][TrajMode[1]][l];
try_a_step<L>(Param_old, K, N, dN, TEMP, Stage); //long calculation...
for(int l = 0; l < L+6; l++) Parameters[TrajMode[0]][TrajMode[1]][l] = Param_old[l];
N_array[TrajMode].x = N;
N_array[TrajMode].y = dN;
}
});
}
该过程类似于Nbody样本,其中N =时间,dN = deltatime,并且Param_old [L + 6]将包含粒子的位置,速度等.
The process is similar to the Nbody sample where N = time, dN = deltatime and Param_old[L+6] would contain the particles position, velocity etc.
我怀疑一个主要问题是由我需要创建的局部变量数量引起的.除了上述功能外,try_a_step还创建以下内容:
I suspect one main problem is caused by the amount of local variables I need to create. In addition to those above the function try_a_step creates the following:
float_2 Param_new [L + 6]
float_2 Param_new[L+6]
float_2 Param_error [L + 6]
float_2 Param_error[L+6]
float_2 ak [6 *(L + 6)]
float_2 ak[6*(L+6)]
所以在最简单的情况下,每个线程至少会有100个浮点数,这意味着它们可能会溢出到全局内存中?
So in the simplest case there's going to be atleast 100 floats for every thread which means they are probably spilling to global memory?
这会比将参数重写为2个不同的array_view< float>会产生更大的影响吗?
Will this have a greater impact than say, re-writing Parameters as 2 different array_view<float>?
除了将每个呼叫分成几个并行之外,我还有很多事情要做吗?
Is there much else I can do apart from splitting it up into several parallel for each calls?
推荐答案
这篇关于减少寄存器溢出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!