本文介绍了减少寄存器溢出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我一直在用C ++ AMP编写程序,虽然我的代码比第一个串行版本快很多,但最多仅比在ATI Radeon 5870(移动版)上使用WARP快2倍.

Hello I've been writing my program in C++ AMP and while my code is a lot faster than the first serial version, it is only at best 2x faster than using WARP on an ATI Radeon 5870 (mobile).

问题是要针对不同的初始数据多次求解(L + 6)个方程(L = 1,2,...,8).主要的计算工作是反复调用此函数:

The problem is to solve (L+6) equations (L = 1,2,...,8) lots of times for different initial data. The main computational effort is spent repeatedly calling this function:

//performs an RKCK step for mode functions
template<int L> void RKCK(array_view<float_2, 3> Parameters, array_view<float_2, 2> N_array, array_view<float, 2> K_array, amp_extent_2 &Traj_Modes, float TEMP, int Stage){


//loop over all trajectories
parallel_for_each(Traj_Modes, [=](index<2> TrajMode)restrict(amp){

	float N = N_array[TrajMode].x;
	float K = K_array[TrajMode];
	float H = Parameters[TrajMode[0]]TrajMode[1]][L].x;

	if(K > TEMP*expf(N)*H && N > -0.9f){

		float_2 Param_old[L+6];

		float dN = N_array[TrajMode].y;
		for(int l = 0; l < L+6; l++) Param_old[l] = Parameters[TrajMode[0]][TrajMode[1]][l];

		try_a_step<L>(Param_old, K, N, dN, TEMP, Stage);	//long calculation...

		for(int l = 0; l < L+6; l++) Parameters[TrajMode[0]][TrajMode[1]][l] = Param_old[l];
		N_array[TrajMode].x = N;
		N_array[TrajMode].y = dN;
		}
});
}


该过程类似于Nbody样本,其中N =时间,dN = deltatime,并且Param_old [L + 6]将包含粒子的位置,速度等.

The process is similar to the Nbody sample where N = time, dN = deltatime and Param_old[L+6] would contain the particles position, velocity etc.

我怀疑一个主要问题是由我需要创建的局部变量数量引起的.除了上述功能外,try_a_step还创建以下内容:

I suspect one main problem is caused by the amount of local variables I need to create. In addition to those above the function try_a_step creates the following:

float_2 Param_new [L + 6]

float_2 Param_new[L+6]

float_2 Param_error [L + 6]

float_2 Param_error[L+6]

float_2 ak [6 *(L + 6)] 

float_2 ak[6*(L+6)] 

所以在最简单的情况下,每个线程至少会有100个浮点数,这意味着它们可能会溢出到全局内存中?

So in the simplest case there's going to be atleast 100 floats for every thread which means they are probably spilling to global memory?

这会比将参数重写为2个不同的array_view< float>会产生更大的影响吗?

Will this have a greater impact than say, re-writing Parameters as 2 different array_view<float>?

除了将每个呼叫分成几个并行之外,我还有很多事情要做吗?

Is there much else I can do apart from splitting it up into several parallel for each calls?



推荐答案


这篇关于减少寄存器溢出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 16:22
查看更多