问题描述
从以下代码的结果中获取数据恢复:
To obtain data resume from results like following code:
int k = get_global_id(0);
double result=d[k]*d[k];
必须使用非常难以执行的减少并降低代码清晰度在以下链接中:
所以opencl有一个命令可以做到这一点(也是kuda)
我尝试了什么:
这是我减少的工作代码:
It must be used reductions that is very difficult to perform and reduces the code cleariness as said in following link:
http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/"]http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions
So it would be an advantage that the opencl had a command to do it (also over kuda)
What I have tried:
This is the smaller working code I made to make reductions:
if (k==0)
{
int size=HEIGHT;
while(size>1)
{
barrier(CLK_GLOBAL_MEM_FENCE ); //to give time to all rms[k] of the level be filled
rms[size]=0.0f;size=(1+size)/2;
rms[k]+=rms[k+size];
}
}
barrier(CLK_GLOBAL_MEM_FENCE ); //to give time to all rms[k] be filled
media=rms[0]/(float) WIDTH/(float) HEIGHT;
推荐答案
work_group_reduce_add()
此命令将所有参与者线程的元素添加到单个值中,并将其广播到所有参与者线程。 />
它的执行时间与优化的自定义算法相当,因此它可以自动升级以提高速度,具体取决于硬件供应商对这些功能的实现。
你可以说只是工作组范围还不够,但关于OpenCL和CUDA的全部内容正在将工作分解成更小的部分并同时计算它们。您可以使用这些函数来计算具有足够好性能的全局和减少。
this command adds all participant threads' elements into single value and broadcasts it to all participant threads.
Its execution time will be comparable to an optimized custom algorithm so it can get automatically upgraded for speed, depending on the hardware vendor's implementation of these functions.
You can say "just workgroup scope is not enough" but whole thing about OpenCL and CUDA is breaking a work into smaller pieces and computing them in parallel. You could use these functions to compute a global sum reduction with good enough performance.
这篇关于可以创建一个新的opencl命令来生成减少量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!