是否可以保证WaveFront(OpenCL)中的所有线程始终同步?

本文介绍了是否可以保证WaveFront(OpenCL)中的所有线程始终同步?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

众所周知，有WARP(在CUDA中)和WaveFront(在OpenCL中): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf

As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf

WARP : http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

...

warp一次执行一条普通指令，因此完全有效当一个warp的所有32个线程都同意执行时，将实现小路.如果经线的线程通过依赖于数据的条件而发散分支，warp串行执行所采用的每个分支路径，从而禁用不在该路径上的线程，并且在所有路径完成后，线程收敛回到相同的执行路径.分支分歧仅在翘曲内发生； 不同的扭曲独立执行 不管他们执行的是通用代码还是脱节代码路径.

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

SIMT架构类似于SIMD(单指令，多指令) 数据)矢量组织，只需一条指令即可控制多个处理元素.一个关键的区别是SIMD向量组织将SIMD宽度公开给软件，而SIMT 指令指定单个的执行和分支行为线程.

The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread.

WaveFront : https://sites.google.com/site/csc8820/opencl-basics/opencl-terms-explained#TOC-Wavefront

WaveFront in OpenCL: https://sites.google.com/site/csc8820/opencl-basics/opencl-terms-explained#TOC-Wavefront

即我们知道:

I.e. we know, that:

WARP

个线程是SIMT线程，它们每次总是执行相同的指令，并且始终保持同步，即WARP与 SIMD通道(在CPU上)

threads in WARP (CUDA) - are SIMT-threads, which always executes the same instructions at each time and always are stay synchronized - i.e. threads of WARP are the same as lanes of SIMD (on CPU)

个线程是始终并行执行的线程，但不一定所有线程都执行完全相同的指令，并且不一定所有线程都同步

threads in WaveFront (OpenCL) - are threads, which always executes in parallel, but not necessarily all the threads perform the exact same instruction, and not necessarily all of the threads are synchronized

但是，是否可以保证WaveFront中的所有线程始终保持同步，例如WARP中的线程或SIMD中的通道?

But is there any guarantee that all of the threads in the WaveFront always synchronized such as threads in WARP or as lanes in SIMD?

结论:

WaveFront线程(项目)始终保持同步-锁定步骤:"wavefront在锁定步骤中相对于彼此执行许多工作项."
映射到SIMD块上的WaveFront :波前中的所有工作项都流向两个流控制路径"
即每个WaveFront线程(项目)映射到SIMD车道

WaveFront-threads (items) are always synchronized - lock step: "wavefront executes a number of work-items in lock step relative to each other."
WaveFront mapped on SIMD-block: "all work-items in the wavefront go to both paths of flow control"
I.e. each WaveFront-thread (item) mapped to SIMD-lanes

第20页: http://developer .amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

1.1术语

...

Wavefronts和工作组是与计算有关的两个概念提供数据并行粒度的内核. 波前相对于每个操作，在锁定步骤中执行许多工作项其他. 在矢量上并行执行16个工作项单位，整个波前覆盖了四个时钟周期.它是流量控制可以影响的最低级别.这意味着，如果波前的两个工作项会进入不同的流动路径控制，波前的所有工作项都进入两个流动路径控制.

Wavefronts and work-groups are two concepts relating to compute kernels that provide data-parallel granularity. A wavefront executes a number of work-items in lock step relative to each other. Sixteen workitems are execute in parallel across the vector unit, and the whole wavefront is covered over four clock cycles. It is the lowest level that flow control can affect. This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.

这适用于: http: //amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

(第45页)第2章 GCN设备的OpenCL性能和优化
(第81页)第3章常绿和北部岛屿设备的OpenCL性能和优化

(page-45) Chapter 2 OpenCL Performance and Optimization for GCN Devices
(page-81) Chapter 3 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices

推荐答案

首先，您可以查询一些值:

First, you can query some values:

CL_DEVICE_WAVEFRONT_WIDTH_AMD
CL_DEVICE_SIMD_WIDTH_AMD
CL_DEVICE_WARP_SIZE_NV
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

但仅从主机方面知道.

but only from host side as I know.

让我们假设这些查询返回了64，并且您的问题对线程的隐式同步很重要.

Lets assume these queries returned 64 and your question gives importance to threads' implicit synchronization.

如果有人选择本地范围= 4怎么办?

What if someone chooses local range = 4?

由于opencl从开发人员那里提取了硬件时钟，所以您不知道在运行时内核执行过程中实际的SIMD或WAVEFRONT大小是多少.

Since opencl abstracts hardware clockwork from developer, you can't know what actual SIMD or WAVEFRONT size is from within kernel execution in runtime.

例如，AMD NCU具有64个着色器，但它在同一计算单元内具有16宽SIMD，8宽SIMD，4宽SIMD，2宽SIMD甚至两个标量单元.

For example, AMD NCU has 64 shaders but it has 16-wide SIMD, 8-wide SIMD, 4-wide SIMD, 2-wide SIMD and even two scalar units inside same compute unit.

4个本地线程可以在两个标量和一个2宽单元或SIMD的任何其他组合上共享.内核代码不知道这一点.即使它知道某种计算方式，您也不知道在运行时在随机计算单元(64个着色器)中将哪个SIMD组合用于下一个内核执行(甚至下一个工作组).

4 local threads could be shared on two scalars and one 2-wide unit or any other combination of SIMDs. Kernel code can't know this. Even if it knows somehow computing things, you can't know which SIMD combination will be used for next kernel execution(or even next workgroup) at runtime in a random compute-unit(64 shaders).

或者其中具有4x16 SIMD的GCN CU可以为每个SIMD分配1个线程，从而使所有4个线程完全独立.如果他们都住在同一个SIMD中，那么您会很幸运.无法保证知道内核执行之前.即使您知道之后，下一个内核也可能会有所不同，因为不能保证选择相同的SIMD分配(后台内核，3d可视化软件，甚至OS都可能在管道中冒出气泡)

Or a GCN CU, which has 4x16 SIMDs in it, could allocate 1 thread per SIMD, making all 4 threads totally independent. If they all reside in same SIMD, youre lucky. There is no guarantee knowing that "before" kernel execution. Even after you know, next kernel could be different since there is no guarantee of choosing same SIMD allocation(background kernels, 3d visualization softwares, even OS could be putting bubbles in pipelines)

不能保证在内核执行之前命令/提示/查询N个线程以相同的SIMD或相同的WARP运行.然后在内核中，没有命令像get_global_id(0)一样获取线程的波前索引.然后在内核之后，您将不能依赖数组结果，因为下一次内核执行可能不会对完全相同的项目使用相同的SIMD.甚至可以将其他波阵面中的某些项目与当前波阵面中的项目交换，仅用于通过驱动程序或硬件进行优化(nvidia最近已使用loadbalancer，并且可能一直在这样做，而且amd的NCU将来可能会使用类似的东西)

There is no guarantee of commanding/hinting/querying of N threads to run as same SIMD or same WARP before kernel execution. Then in the kernel, there is no command to get a thread's wavefront index just like get_global_id(0). Then after kernel, you can't rely on array results since next kernel execution may not use same SIMDs for exact same items. Even some items from other wavefronts could be swapped with an item from current wavefront just for an optimization by driver or hardware (nvidia has loadbalancer lately and could have been doing this, also NCU of amd may use similar thing in future)

即使您猜测硬件和驱动程序上SIMD上线程的正确组合，在另一台计算机上也可能完全不同.

Even if you guess right combination of threads on SIMDs on your hardware and driver, it could be totally different in another computer.

如果从性能角度考虑，您可以尝试:

内核代码中的零分支
零内核在后台运行
gpu未用于监视器输出
gpu未用于某些可视化软件

只需确保％99的概率，流水线中就不会出现气泡，因此所有线程都在同一周期内退出一条指令(或至少同步至最先退出一条指令).

Just to make sure %99 probability, there are no bubbles in pipelines so all threads retire an instruction at the same cycle(or at least synchronize at latest retiring one).

或者，在每条指令之后添加一个篱笆以在非常慢的全局或本地存储器上进行同步.栅栏使工作项级别同步，栅栏使本地组级别同步.没有波前同步命令.

Or, add a fence after every instruction to synchronize on global or local memory which is very slow. Fences make workitem level synhronization, barriers make local group level synchronization. There are no wavefront synchronization commands.

然后，在同一SIMD中运行的那些线程将表现为同步.但是您可能不知道这些线程是哪些以及哪些SIMD.

Then, those threads that run within same SIMD will behave synchronized. But you may not know which threads those are and which SIMDs.

对于4线程示例，对所有计算使用float16可能会使驱动程序使用AMD GCN CU的16宽SIMD进行计算，但是它们现在不再是线程，而仅是变量.但这应该比线程具有更好的数据同步性.

For the 4-thread example, using float16 for all calculations may let the driver use 16-wide SIMDs of AMD GCN CU to compute but then they are not threads anymore, only variables. But this should have better synchronization on data, than threads.

还有更复杂的情况，例如:

There are more complex situations such as:

在同一SIMD中有4个线程，但是一个线程计算会生成一些NaN值，并对此进行额外的归一化(可能需要1-2个周期).其他3个应等待完成，但它独立于数据相关的速度下降而起作用.

4 threads in same SIMD but one thread calculation generates some NaN value and does an extra normalization(taking 1-2 cycle maybe) on that. 3 others should wait for completion but it works independently of data related slowdowns.

4个线程处于循环中，并且其中一个永远卡住.他们中的3个人等待第4个人永远完成，还是驱动程序检测到并将其移至另一个空SIMD?或第4个人同时也等着其他3个人，因为他们也没动！

4 threads in same wavefront are in a loop and one of them stuck forever. 3 of them wait for the 4th one to finish forever or driver detects and moves it to another free-empty SIMD? Or 4th one waits for other 3 at the same time because they are not moving too!

4个线程，一个接一个地进行原子操作.

4 threads doing atomic operations, one by one.

Amd的HD5000系列GPU的SIMD宽度为4(或5)，但波前尺寸为64.

Amd's HD5000 series gpu has SIMD width 4(or 5) but wavefront size is 64.

这篇关于是否可以保证WaveFront(OpenCL)中的所有线程始终同步?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！