本文介绍了渲染命令中着色器调用的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

着色器具有调用,每个调用(通常)都被赋予唯一的一组输入数据,并且每个(通常)都被写入各自的单独输出数据.发出渲染命令时,每个着色器被调用多少次?

Shaders have invocations, which each are (usually) given a unique set of input data, and each (usually) write to their own separate output data. When you issue a rendering command, how many times does each shader get invoked?

推荐答案

每个着色器阶段都有自己的调用频率.我将使用OpenGL术语,但是D3D的工作方式相同(因为它们都为相同的硬件关系建模).

Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they're both modelling the same hardware relationships).

这些是第二复杂的.他们为每个输入顶点执行一次...有点.如果您使用的是非索引渲染,则该比例正好为1:1.每个输入顶点将在单独的顶点着色器实例上执行.

These are the second most complicated. They execute once for every input vertex... kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.

如果您使用索引渲染,那么它会变得很复杂.它大约是1:1,每个顶点都有自己的VS调用.但是,由于 T& L缓存,顶点着色器是可能的每个输入顶点执行的次数少于 .

If you are using indexed rendering, then it gets complicated. It's more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.

请参见,假定顶点着色器的执行是在输入顶点数据和输出顶点数据之间创建1:1映射.这意味着,如果将相同的输入数据传递到顶点着色器(在同一渲染命令中),则您的VS会生成相同的输出数据.因此,如果硬件可以检测到它即将对先前使用的相同输入数据执行顶点着色器,则它可以跳过该执行过程,而仅使用先前执行的输出.假设它具有那些值,例如在缓存中.

See, a vertex shader's execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.

硬件通过使用顶点的索引(这就是为什么它不适用于非索引渲染)检测到这一点的原因.如果为顶点着色器提供相同的索引,则假定该着色器将获得所有相同的输入值,因此将生成相同的输出值.因此,硬件将基于索引缓存输出值.如果索引位于T& L之后的缓存中,则硬件将跳过VS的执行,而仅使用输出值.

Hardware detects this by using the vertex's index (which is why it doesn't work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS's execution and just use the output values.

实例化仅会使T& L后的缓存稍微复杂化.它不是仅在顶点索引上进行缓存,而是根据索引和 instance ID进行缓存.因此,仅当两个值相同时,才会使用缓存的数据.

Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.

因此,通常,VS对每个顶点执行一次,但是如果使用索引数据优化几何,则VS执行的次数会减少.有时很多更少,具体取决于您的操作方式.

So generally, VS's execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.

或者说D3D的船体着色器.

Or Hull Shaders in D3D parlance.

在这方面,TCS非常简单.对于渲染命令的每个补丁中的每个顶点,它将只执行一次.这里没有进行缓存或其他优化.

The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.

或者说用D3D来定义域着色器.

Or Domain Shaders in D3D parlance.

在细分图元生成器生成新顶点之后,将执行TES.因此,它的执行频率显然取决于您的镶嵌参数.

The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.

TES将获取由细分器生成的顶点并输出这些顶点.这样做的比例是1:1.

The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.

但是与顶点着色器"相似,对于每个输出图元中的每个顶点,它不一定是1:1.像VS一样,假定TES在棋盘格化图元中的位置与输出参数之间提供直接1:1映射.因此,如果您使用相同的补丁程序位置多次调用TES,则预计将输出相同的值.

But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.

因此,如果生成的图元共享顶点,则通常对于此类共享顶点仅调用一次TES.与顶点着色器不同,您无法控制硬件将利用它的程度.您能做的最好的就是希望生成算法足够聪明,以最小化调用TES的频率.

As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.

将为每个点,线或三角形图元调用一次几何着色器,该着色器直接由渲染命令指定或由镶嵌器生成.因此,如果将6个顶点渲染为未连接的线,则GS将被精确调用3次.

A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.

每个GS调用都可以生成零个或多个原语作为输出.

Each GS invocation can generate zero or more primitives as output.

GS可以在内部使用实例化(在OpenGL 4.0或Direct3D 11中).这意味着,对于到达GS的每个原语,GS将被调用X次,其中X是GS实例的数量.每个这样的调用将获得相同的输入原始数据(具有用于区分这些实例的特殊输入值).这对于将原语更有效地定向到分层帧缓冲区的不同层很有用.

The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.

或者说D3D中的像素着色器.即使它们还不是像素,也可能不会变成像素,并且对于 same pixel,它们可以执行多次;)

Or Pixel Shaders in D3D parlance. Even though they aren't pixels yet, may not become pixels, and they can be executed multiple times for the same pixel ;)

关于调用频率,这是最复杂的.他们执行的频率取决于很多事情.

These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.

FS执行一次.但是它们可能执行得更多.

FS's must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.

为了计算纹理函数的导数,一个FS调用通常会从其相邻调用之一中借用值.如果没有 这样的调用,如果邻居落在要栅格化的图元的边界之外,则这是有问题的.

In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.

在这种情况下,仍然会有相邻的FS调用.即使它不产生任何实际数据,它仍然存在并且仍然有效.好的方面是,这些助手调用不会损害性能.他们基本上是在消耗着色器资源,否则这些着色器资源将不会被使用.同样,系统将忽略此类助手调用实际输出数据的任何尝试.

In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don't hurt performance. They're basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.

但是从技术上讲它们仍然存在.

But they do still technically exist.

一个不太透明的问题与多重采样有关.看到,多重采样实现(尤其是在OpenGL中)被允许自行决定要发出多少个FS调用.尽管有多种方法可以强制多样本渲染为每个样本创建FS调用,但没有保证的实现,在这些情况下,每个覆盖像素只能执行一次FS.

A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.

例如,如果我没记错的话,如果您在某些NVIDIA硬件(8到16或类似的东西)上创建了具有高样本计数的多样本图像,则该硬件可能决定多次执行FS.不一定每个样本一次,而是每四个样本一次.

For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.

那么您获得了多少个FS调用?对于要栅格化的图元而言,每个像素大小的区域至少要有一个.如果您要进行多采样渲染,则可能更多.

So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you're doing multisampled rendering.

您指定的确切调用次数.也就是说,您分派的工作组数* CS指定的每个组的调用数(您的本地组数).不多不少.

The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.

这篇关于渲染命令中着色器调用的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 00:17