使用结构作为缓冲支架

使用结构作为缓冲支架

本文介绍了使用结构作为缓冲支架的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我当前的OpenCL实现中,我想节省参数的时间,避免每次我想在内核中使用缓冲区并为我的内核使用较短的参数列表时都传递它们.

In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.

因此,我创建了一个结构(工作区),该结构将指向缓冲区的指针保存在设备内存中,该结构的行为就像一个具有成员变量的对象,您想要随时访问它,并且希望在整个执行过程中保持生命.我从来没有在AMD GPU甚至CPU上遇到过问题.但是Nvidia对此造成了很多问题.始终似乎是对齐问题,从未到达正确的缓冲区等.

So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.

这里有一些代码可以帮助您,请参见以下问题:

Here some code to help, see question below:

在主机上定义的结构:

 #define SRC_IMG 0       // (float4 buffer) Source image
 #define LAB_IMG 1       // (float4 buffer) LAB image

 // NOTE: The size of this array should be as much as the last define + 1.
 #define __WRKSPC_SIZE__ 2

 // Structure defined on host.
 struct Workspace
 {
      cl_ulong getPtr[__WRKSPC_SIZE__];
 };

 struct HostWorkspace
 {
      cl::Buffer srcImg;
      cl::Buffer labImg;
 };

设备上定义的结构:

typedef struct __attribute__(( packed )) gpuWorkspace
{
    ulong getPtr[__WRKSPC_SIZE__];
} gpuWorkspace_t;

请注意,在设备上,我使用ulong,在主机上,我使用cl_ulong,如下所示 OpenCL:使用struct作为内核参数.

Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.

因此,一旦创建了用于源图像或LAB图像的cl :: Buffer,我将它们保存到HostWorkspace对象中,因此在释放该对象之前,将保留对cl :: Buffer的引用,因此整个项目中都存在缓冲区在主机上,在设备上事实上.

So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.

现在,我需要给那些设备喂食,所以我有一个简单的内核,它可以如下初始化我的设备工作区:

Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:

__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
                             __global float4* src,
                             __global float4* LAB)
{
    // Get the ulong pointer on the first element of each buffer.
    wrkspc->getPtr[SRC_IMG] = &src[0];
    wrkspc->getPtr[LAB_IMG] = &LAB[0];
}

其中wrkspc是用struct Workspace分配的缓冲区,而src + LAB只是作为1D阵列映像分配的缓冲区.

where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.

然后,在我的任何内核中,如果要使用src或LAB,请按以下步骤操作:

And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:

__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
    // =============================================================
    // Get pointer from work space.
    // =============================================================

    // Cast back the pointer of first element as a normal buffer you
    // want to use along the execution of the kernel.
    __global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
    __global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );

    // Code kernel as usual.
}

当我开始使用它时,我喜欢4-5张图像,它们运行得很好,但结构却不同:

When I started to use this, I had like 4-5 images which was going well, with a different structure like this:

struct Workspace
{
    cl_ulong imgPtr;
    cl_ulong labPtr;
};

每个图像都有自己的指针.

where each image had there own pointer.

在某个时候,我获得了更多的图像,并且遇到了一些问题.因此,我在线搜索,发现了一些建议,即设备/主机之间的结构的sizeof()可能不同,因此我将其同时更改为单个数组,并且在16个元素之前都可以正常工作.

At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.

因此,我进行了更多搜索,找到了关于属性((包装))的建议,该建议已放入设备结构中(请参见上文).但是现在,我到达了26个元素,当我在设备或主机上检查结构的大小时,大小为208(元素* sizeof(cl_ulong)== 26 * 8).但是我仍然有一个与先前模型类似的问题,我的指针在先前图像的中间其他地方被读取了,等等.

So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.

所以我想知道,是否有人尝试过类似的模型(也许使用不同的方法),或者有任何技巧来使用此模型建立可靠的"模型.

So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.

请注意,所有内核均已正确编码,在AMD或CPU上使用相同代码执行时,我得到了很好的结果.唯一的问题是在英伟达上.

Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.

推荐答案

不要尝试跨内核边界存储GPU端指针值.他们不能保证保持不变.始终使用索引.而且,如果内核使用特定的缓冲区,则需要将其作为参数传递给该内核.

Don't try to store GPU-side pointer values across kernel boundaries. They are not guaranteed to stay the same. Always use indices. And if a kernel uses a specific buffer, you need to pass it as an argument to that kernel.

参考文献:

  1. OpenCL 1.2规范(据我所知,nvidia并未实现更新的标准)未定义指针到整数强制类型转换的行为,反之亦然.
  2. 第6.9p节指出:声明为结构或联合的内核函数的参数不允许将OpenCL对象作为结构或联合的元素进行传递." 您正在尝试做:将缓冲区的结构传递给内核.
  3. 第6.9a节指出:程序中内核函数的参数不能声明为指向一个或多个指针的指针." -本质上,这是您试图通过强制转换来颠覆的对象您的指针指向整数并返回. (第1点)您不能通过绕过类型系统来欺骗" OpenCL使其定义良好.
  1. The OpenCL 1.2 specification (as far as I'm aware, nvidia does not implement a newer standard) does not define the behaviour of pointer-to-integer casts or vice versa.
  2. Section 6.9p states: "Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union." This is exactly what you are attempting to do: passing a struct of buffers to a kernel.
  3. Section 6.9a states: "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)." - This is essentially what you're trying to subvert by casting your pointers to an integer and back. (point 1) You can't "trick" OpenCL into this being well-defined by bypassing the type system.

正如我在下面的注释线程中建议的那样,您将需要使用索引将位置保存在缓冲区对象内.如果要在不同的内存区域中存储位置,则需要将多个缓冲区统一为一个并将一个索引保存到该巨型缓冲区中,或者需要保存一个标识要引用的缓冲区的数值.

As I suggest in the comment thread below, you will need to use indices to save positions inside a buffer object. If you want to store positions across different memory regions, you'll need to either unify multiple buffers into one and save one index into this giant buffer, or save a numeric value that identifies which buffer you are referring to.

这篇关于使用结构作为缓冲支架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-15 18:23