本文介绍了访问/同步到本地内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对GPGPU编程还很陌生.我正在尝试实现需要大量同步的算法,因此它仅使用一个工作组(全局和本地大小具有相同的值)

I'm pretty new to GPGPU programming. I'm trying to implement algorithm that needs lot of synchronization, so its using only one work-group (global and local size have the same value)

我有一个休闲问题:我的程序可以正常运行,直到问题大小超过32.

I have fallowing problem: my program is working correctly till size of problem exceeds 32.

__kernel void assort(
__global float *array,
__local float *currentOutput,
__local float *stimulations,
__local int *noOfValuesAdded,
__local float *addedValue,
__local float *positionToInsert,
__local int *activatedIdx,
__local float *range,
int size,
__global float *stimulationsOut
)
{
int id = get_local_id(0);
if (id == 0) {...}

barrier(CLK_LOCAL_MEM_FENCE);

for (int i = 2; i < size; i++)
{
    int maxIdx;
    if (id == 0)
   {
   addedValue[0] = array[i];
   {...}
   }
    barrier(CLK_LOCAL_MEM_FENCE);


    if (id < noOfValuesAdded[0]){...}
    else
        barrier(CLK_LOCAL_MEM_FENCE);
   barrier(CLK_LOCAL_MEM_FENCE);
   if (activatedIdx[0] == -2) {...}
   else {...}

   barrier(CLK_LOCAL_MEM_FENCE);
   if (positionToInsert[0] != -1.0f) {...}

    barrier(CLK_LOCAL_MEM_FENCE);
    stimulationsOut[id] = addedValue[0];
    return;
    }

经过一些尝试性的尝试后,我意识到(通过对timulationsOut的检查),addedValue [0]与内核的第33个实例具有不同的值,然后与内核的第65个实例具有不同的值(因此其类似于[123 123 123 ... 123(第33个元素)66 ... 66 66 66 66 66 ..(第65个元素)127 ... .. 127 ...])

After some investigation attemp I realized (by inspection of stimulationsOut), that addedValue[0] has diffrent value from 33rd instanction of the kernel, and then another value from 65th (so its like [123 123 123 ... 123 (33rd element) 66 ... 66 66 66 66 66 .. (65th element) 127 ... .. 127 ...])

__全局浮点数* array为READ_ONLY,如果在for循环中,我不会在first旁边更改addedValue [0].有什么原因可以解决这个问题?

__global float *array is READ_ONLY and I do not change addedValue[0] beside first if in for loop. What could couse this issue?

我的GPU规格:[ https://devtalk.nvidia.com/default/topic/521502/gt650m-a-kepler-part-/]

注释掉两个if的身体问题后仍无法解决:

After commenting out of two if's body problem is not reccuring:

            /*if (activatedIdx[0] == -2)
        {
            if (noOfValuesAdded[0] == 2)
            {
                positionToInsert[0] = 0.99f;
            }
            else if (id != 0 && id != maxIdx
                     && stimulations[id] >= stimulations[(id - 1)]
                     && stimulations[id] >= stimulations[(id + 1)])
           {
               if ((1.0f - (fabs((currentOutput[(id - 1)] -  currentOutput[id])) / range[0])) < stimulations[(id - 1)])
                    positionToInsert[0] = (float)id - 0.01f;
                    else
                positionToInsert[0] = (float)id + 0.99f;
            }
        }*/

    if (positionToInsert[0] != -1.0f)
    {
        float temp = 0.0f;
        /*if ((float)id>positionToInsert[0])
        {
            temp = currentOutput[id];
            barrier(CLK_LOCAL_MEM_FENCE);
            currentOutput[id + 1] = temp;
        }
        else
        {
            barrier(CLK_LOCAL_MEM_FENCE);
        }*/
        barrier(CLK_LOCAL_MEM_FENCE);

        if (id == round(positionToInsert[0]))
        {
            currentOutput[id] = addedValue[0];
            noOfValuesAdded[0] = noOfValuesAdded[0] + 1;
        }
    }

更新:修复障碍之后,算法可以正常工作,直到大小超过768(这是我的GPU上核心数的2倍之多).我期望它最多可以处理1024个元素,这是最大的工作组大小.我想念什么吗?

Update:After fixing barriers, algorithm works properly until size exceeds 768 (which is weirdly 2 times numbers of cores on my gpu). I was expecting, that it will work up to 1024 elements, which is maximal work group size. Am I missing something?

推荐答案

经纱中的所有工作项都在锁步中执行相同的指令. Nvidia上的经线大小为32个工作项.如果内核最多可以正确处理32个工作项,则表明障碍存在问题.

All work items in a warp execute the same instruction in lock-step. Warp size on Nvidia is 32 work items. If the kernel works correctly up to 32 work items this suggest there is something wrong with barriers.

barrier 的文档说:

我可以看到这是您内核中的问题.例如此处:

I can see this being the issue in your kernel. For example here:

if ((float)id>positionToInsert[0])
{
    temp = currentOutput[id];
    barrier(CLK_LOCAL_MEM_FENCE); // <---- some work items may meet here
    currentOutput[id + 1] = temp;
}
else
{
    barrier(CLK_LOCAL_MEM_FENCE); // <---- other work items may meet here
}

您可以通过以下方式解决此问题:

You could probably fix this by:

if ((float)id>positionToInsert[0])
    temp = currentOutput[id];
barrier(CLK_LOCAL_MEM_FENCE); // <---- here all work items meet at the same barrier
if ((float)id>positionToInsert[0])
    currentOutput[id + 1] = temp;

这篇关于访问/同步到本地内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 18:23