本文介绍了执行“条件呼叫”。在amd64上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在代码的关键部分考虑条件函数调用时,我发现gcc和clang都会在调用周围分支。例如,对于以下(公认的琐碎)代码:

  int32_t __attribute __((noinline))negate(int32_t num){
return -num;
}

int32_t f(int32_t num){
int32_t x = num< 0? negate(num):num;
返回2 * x + 1;
}

GCC和clang都基本编译为以下内容:

  .global _f 
_f:
cmp edi,0
jg after_call
call _negate
after_call:
lea rax,[rax * 2 + 1]
ret

这让我开始思考:如果x86具有条件调用指令(如ARM)怎么办?想象一下,是否有这样的指令 ccall cc ,其语义类似于cmov cc 。然后您可以执行以下操作:

  .global _f 
_f:
cmp edi,0
ccalll _negate
lea rax,[rax * 2 + 1]
ret

尽管我们无法避免分支预测,但是我们确实消除了分支。即,在实际的GCC / clang输出中,无论 num< 0 。并且如果 num< 0 ,我们必须分支两次。这似乎很浪费。



现在amd64中不存在这样的指令,但是我设计了一种模拟这样的指令的方法。我是通过将呼叫功能分解为其组成部分来做到这一点的: push rip (从技术上讲, [ rip + label_after_call_instruction] ),然后再 jmp func 。我们可以使 jmp 有条件,但没有条件的 push 。我们可以通过计算 [rip + label_after_call_instruction] 并将其写入堆栈中的适当位置,然后有条件地更新 rsp 如果我们计划调用该函数(实际上是推动 [rip + label_after_call_instruction] )。看起来像这样:



  .global _f 
_f:
cmp edi,0

#ccalll _negate
lea rax,[rip + after_ccall]#计算返回地址
mov [rsp-8], rax#准备推送返回地址
lea rax,[rsp-8]#计算rsp(推送后)
cmovl rsp,rax#有条件推送(通过实际更改rsp)
jl _negate#有条件的通话
after_ccall:

lea rax,[rax * 2 + 1]
ret

此方法有一些潜在缺点:




  • 它引入了一些说明(但它们的总循环次数比分支错误预测的惩罚要少)

  • 它需要写入内存(但是堆栈可能已缓存?)

  • 它始终执行2个 lea s和 mov ,即使未调用发现这无关紧要,因为cmov cc 的周期数与mov相同,例如)



iaca
中运行了关键部分。如果已安装(并在下面克隆了我的基准要点),则可以运行 make iaca 亲自查看。传递 IACAFLAGS ='-arch = ...'来指定其他拱形。



分支的输出解决方法:

 英特尔(R)体系结构代码分析器版本-v3.0-28-g1ba2cbb构建日期:2017-10- 30; 16:57:45 
分析文件-./branch_over_call_iaca.o
二进制格式-64Bit
体系结构-SKL
分析类型-吞吐量

吞吐量分析报告
--------------------------
区块吞吐量:0.82个周期吞吐量瓶颈:依赖链
循环计数:36
每个迭代的周期中的端口绑定:
------------------------------ -------------------------------------------------- ------------------
|港口0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 |
---------------------------------------------- -------------------------------------------------- -
|周期| 0.5 0.0 | 0.0 | 0.3 0.0 | 0.3 0.0 | 1.0 | 0.0 | 0.5 | 0.3 |
---------------------------------------------- -------------------------------------------------- -

DV-分频器管道(在端口0上)
D-数据获取管道(在端口2和3上)
F-与上一条指令发生了宏融合
*-指令微操作未绑定到端口
^-发生了微融合
#-发出了ESP跟踪同步uop
@-SSE指令跟随AVX256 / AVX512指令,数十条周期的惩罚是预期的
X-不支持该指令,未在Analysis中计入

|数量循环中的端口压力| |
|哎呀0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 |
---------------------------------------------- -------------------------------------------
| 1 | 0.5 | | | | | | 0.5 | | jnle 0x6
| 4 ^#| | | 0.3 | 0.3 | 1.0 | | | 0.3 |调用0x5
Uop总数:5

条件调用方法的输出:

 英特尔(R)体系结构代码分析器版本-v3.0-28-g1ba2cbb构建日期:2017-10-30; 16 :57:45 
分析的文件-./conditional_call_iaca.o
二进制格式-64Bit
体系结构-SKL
分析类型-吞吐量

吞吐量分析报告
--------------------------
区块吞吐量:1.94周期吞吐量瓶颈:依赖链
循环计数:35每个循环的
端口绑定:
--------------------------------- -------------------------------------------------- ---------------
|港口0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 |
---------------------------------------------- -------------------------------------------------- -
|周期| 1.0 0.0 | 1.0 | 0.5 0.0 | 0.5 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
---------------------------------------------- -------------------------------------------------- -

DV-分频器管道(在端口0上)
D-数据获取管道(在端口2和3上)
F-与上一条指令发生了宏融合
*-指令微操作未绑定到端口
^-发生了微融合
#-发出了ESP跟踪同步uop
@-SSE指令跟随AVX256 / AVX512指令,数十条周期的惩罚是预期的
X-不支持该指令,未在Analysis中计入

|数量循环中的端口压力| |
|哎呀0-DV | 1 | 2-D | 3-D | 4 | 5 | 6 | 7 |
---------------------------------------------- -------------------------------------------
| 1 | | 1.0 | | | | | | | lea rax,ptr [rip]
| 2 ^ | | | 0.5 | 0.5 | 1.0 | | | | mov qword ptr [rsp-0x8],rax
| 1 | | | | | | 1.0 | | | lea rax,ptr [rsp-0x8]
| 1 | 1.0 | | | | | | | | cmovl rsp,rax
| 1 | | | | | | | 1.0 | | jl 0x6
Uop总数:6

我似乎认为条件调用方法似乎使用更多的硬件。但是我发现有趣的是,有条件的方法只有另外1个uop(方法上的分支有5个uop)。我想这是有道理的,因为在幕后,调用变成了push和jmp(而push变成了rsp math和memory mov)。这将向我表明条件调用方法是大致等效的(尽管也许我的简化分析在这里有缺陷?)。



至少,我的主要怀疑是在 cmp jl 之间引入一些指令,我可能会把<$ c $的结果c> cmp 将在 jl 被推测性执行之前可用(因此完全避免了分支预测)。尽管也许管线比这更长? (尽管已阅读并保留了对的中等深度的了解, )我不是很熟悉。



我的假设是 num (负数和正数)的均匀分布s(分支预测无法预测呼叫附近的分支),我的条件调用方法将优于围绕该呼叫的分支。



我写了一个?

解决方案

您可以确切确定为什么 conditional_call 方法比 branch_over_call 。您已经在两个KBL处理器上进行了实验,但是提到您没有讨论RAS如何在KBL上工作。因此,分析的第一步是确定求反函数中的 ret 是否被错误预测了(因为在早期的微架构上会发生什么)。第二步是确定错误地预测 ret 指令对总执行时间的代价是多少。我最接近KBL的是CFL,事实证明我的数字接近您。两者之间唯一相关的区别是LSD在CFL中启用,但在KBL中禁用。但是,由于循环中的 call 指令阻止了LSD检测任何循环,因此LSD在这种情况下无关紧要。您还可以轻松地在KBL上重复相同的分析。



有多种方法可以分析分支指令的行为。但是在这种特殊情况下,代码很简单,可以通过事件计数方法显示每个静态分支指令所需的所有信息。



<$ c $可以使用c> BR_INST_RETIRED _ * 个性能事件来计算已退出的动态分支指令的总数以及已退休的分支指令的特定类型的总数,包括条件,调用和返回。 BR_MISP_RETIRED _ * 事件可用于计算总的错误预测,总的条件错误预测和总的呼叫错误预测。



conditional_call 的完整控制辉光图如下所示:

 总错误
通话1 0
jl 1 0.5
ret 0.5 1
ret 1 0
jne 1 0

第一个调用指令调用 conditional_call 函数,该函数包含 jl ret jl 指令有条件地跳转到 negate 函数,该函数包含 ret jne 指令用于循环。第一列和第二列中显示的数字分别通过迭代总数和动态指令总数进行规范化。从程序的静态结构我们知道 call jl conditional_call ret 和在每个迭代中都执行一次。最里面的 ret 仅在使用 jl 分支时执行。使用性能事件,我们可以计算已执行的返回指令的总数,并从中减去迭代的总数,以获取最里面的 ret 被执行的次数。因为输入是根据均匀分布随机分配的,所以最里面的 ret 被执行一半的时间也就不足为奇了。



调用指令永远不会出错。 jne 指令也永远不会被错误预测,除非最后执行该指令(退出循环)。因此,我们可以将条件错误预测的总数归因于 jl 指令。可以从错误预测的总数中减去该值,以获得可以归因于两个或两个返回指令的返回错误预测的数量。当第一个 ret 错误预测错误或未对准RAS时,第二个 ret 可能会预测错误。确定第二个 ret 是否曾经被错误预测的一种方法是使用 BR_MISP_RETIRED.ALL_BRANCHES 的精确采样。另一种方法是使用您引用的博客文章中描述的方法。确实,只有最里面的 ret 被错误地预测了。 jl 一半时间被错误预测的事实表明,该指令是被预测为始终采用或始终不采用。



branch_over_call 的完整控制辉光图如下所示:

 总差额
通话1 0
jg 1 0.5
通话0.5 0
ret 0.5 0
ret 1 0
jne 1 0

唯一错误预测的指令是 jg 一半的时间。



要衡量 ret 错误预测的平均成本> conditional_call 方法,可以将 ret 指令替换为 lea / jmp 序列, BTB而不是RAS用于进行预测。进行此更改后,唯一会被错误预测的指令是 jl 。执行时间的差异可以视为对 ret 错误预测的总成本的估计。在我的CFL处理器上,每个 ret 错误预测大约需要11.3个周期。此外, conditional_call branch_over_call 快了约3%。您在KBL上的数字表示 ret 错误预测的平均成本约为13个周期。我不确定造成这种差异的原因是什么。它可能不是微体系结构。我使用的是gcc 7.3,但是您使用的是gcc 8,所以也许代码中的某些差异或不同代码段的对齐方式导致我们的结果之间存在差异。


When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admittedly trivial) code:

int32_t __attribute__((noinline)) negate(int32_t num) {
    return -num;
}

int32_t f(int32_t num) {
    int32_t x = num < 0 ? negate(num) : num;
    return 2*x + 1;
}

Both GCC and clang compile to essentially the following:

.global _f
_f:
    cmp     edi, 0
    jg      after_call
    call    _negate
after_call:
    lea     rax, [rax*2+1]
    ret

This got me thinking: what if x86 had a conditional call instruction like ARM? Imagine if there was such an instruction "ccallcc" with semantics like cmovcc. Then you could do something like:

.global _f
_f:
    cmp     edi, 0
    ccalll  _negate
    lea     rax, [rax*2+1]
    ret

Although we can't avoid the branch prediction, we do eliminate a branch. Namely, in the actual GCC/clang output, we are forced to branch regardless of whether num < 0 or not. And if num < 0 we have to branch twice. This seems wasteful.

Now such an instruction doesn't exist in amd64, but I devised a way to simulate such an instruction. I did this by breaking call func into its component parts: push rip (well technically [rip+label_after_call_instruction]) and then jmp func. We can make the jmp conditional, but there is no conditional push. We can simulate this by computing [rip+label_after_call_instruction] and writing it to the appropriate location on the stack, then conditionally updating rsp if we plan to call the function (which actually "pushes" the [rip+label_after_call_instruction]). It looks something like this:

.global _f
_f:
    cmp     edi, 0

    # ccalll _negate
    lea     rax, [rip+after_ccall]  # Compute return address
    mov     [rsp-8], rax            # Prepare to "push" return address
    lea     rax, [rsp-8]            # Compute rsp (after push)
    cmovl   rsp, rax                # Conditionally push (by actually changing rsp)
    jl      _negate                 # "Conditional call"
after_ccall:

    lea     rax, [rax*2+1]
    ret

There are a few potential downsides to this approach:

  • It introduces several instructions (but they total less cycles than the branch mispredict penalty)
  • It requires a write to memory (but the stack probably is cached?)
  • It always executes the 2 leas and mov even if the call isn't made (but my understanding is this doesn't matter as cmovcc takes the same number of cycles as mov, for example)

To examine the properties of each of these approaches, I ran the critical sections through iaca. If you have it installed (and you clone my benchmark gist below), you can run make iaca to see for yourself. Pass IACAFLAGS='-arch=...' to specify a different arch.

The output for the branch over approach:

Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./branch_over_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 0.82 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  36
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  0.5     0.0  |  0.0  |  0.3     0.0  |  0.3     0.0  |  1.0  |  0.0  |  0.5  |  0.3  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      | 0.5         |      |             |             |      |      | 0.5  |      | jnle 0x6
|   4^#    |             |      | 0.3         | 0.3         | 1.0  |      |      | 0.3  | call 0x5
Total Num Of Uops: 5

And the output for the conditional call approach:

Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-30;16:57:45
Analyzed File -  ./conditional_call_iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.94 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  35
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  0.5     0.0  |  0.5     0.0  |  1.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             | 1.0  |             |             |      |      |      |      | lea rax, ptr [rip]
|   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | mov qword ptr [rsp-0x8], rax
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [rsp-0x8]
|   1      | 1.0         |      |             |             |      |      |      |      | cmovl rsp, rax
|   1      |             |      |             |             |      |      | 1.0  |      | jl 0x6
Total Num Of Uops: 6

I looks like the conditional call approach appears to use more of the hardware. But I found it interesting that the conditional approach only has 1 more uop (the branch over approach had 5 uops). I guess this makes sense given that under the hood the call turns into a push and jmp (and the push turns into rsp math and a memory mov). This would suggest to me that the conditional call approach is approximately equivalent (although maybe my simplistic analysis is flawed here?).

At the least, my overarching suspicion that was by introducing several instructions between the cmp and jl, I'd make it possible that the result of the cmp would be available before the jl can be speculatively executed (thus preventing the branch prediction at all). Although maybe the pipeline is longer than this? This treads into areas with which (despite having read and retained a medium-depth understanding of Agner Fog's optimization manuals) I am not very familiar.

My hypothesis is that for a uniform distribution of (negative and positive) nums (where branch prediction won't be able to predict the branch around the call) that my "conditional call" approach will outperform branching around the call.

I wrote a harness to benchmark the performance of these two approaches. You can git clone https://gist.github.com/baileyparker/8a13c22d0e26396921f501fe87f166a9 and make to run the benchmarks on your machine.

Here's the runtime of 100 iterations of each approach on an array of 1,048,576 numbers (uniformly distributed between int32_t min and max).

|                    CPU                    | Conditional Call | Branch Over |
|-------------------------------------------|-----------------:|------------:|
| Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz |       10.9872 ms |   8.4602 ms |
| Intel(R) Xeon(R) CPU E3-1240 v6 @ 3.70GHz |        8.8132 ms |   7.0704 ms |

These results are consistent across runs and although magnified by increasing the array size (or number of iterations), branching over always wins.

I also tried reordering the conditional call steps (computing and conditionaly updating rsp first, then writing to the stack) but this performed similarly.

What hardware detail that am I missing (or misunderstanding) explains this? From my calculations the extra instructions add somewhere around 6-7 cycles, but a branch mispredict costs 15. So, on average half the numbers are predicted wrong so each iteration costs 15/2 cycles (for the branch over approach) and always 6-7 cycles for the conditional call. The uops from iaca suggest the approaches are even closer in this regard. So, shouldn't the performance be closer? Is my example code too contrived/short? Is my benchmarking technique not appropriate for this kind of low level critical section testing? Is there a way to reorder/change the conditional call to make it more performant (better or comparable to the branch over approach, maybe)?

tl;dr Why does my conditional call code (4th code snippet) perform worse than what gcc/clang produces (conditional jump over the call) (2nd code snippet) (for the code in the 1st snippet) on my benchmark?

解决方案

You can exactly determine why the conditional_call approach is slower than branch_over_call. You've done your experiments on two KBL processors, but the blog post you were referred to doesn't discuss how the RAS works on KBL. So the first step of the analysis is to determine whether the ret in the negate function is being mispredicted or not (as what would happen on earlier microarchitectures). The second step is to determine what is the cost of mispredicting that ret instruction on total execution time. The closest thing that I have to KBL is CFL and my numbers turned out to be close to yours. The only relevant difference between the two is that LSD is enabled in CFL but disabled in KBL. However, the LSD is irrelevant in this case because of the call instruction in the loop which prevents the LSD from detecting any loop. You can also easily repeat the same analysis on KBL.

There are several ways to analyze the behavior of branch instructions. But in this particular case, the code is simple enough for the event counting method to reveal all the information that we need about every static branch instruction.

The BR_INST_RETIRED_* performance events can be used count the total number of dynamic branch instructions retired and the total number of specific types of retired branch instructions including conditional, calls, and returns. The BR_MISP_RETIRED_* events can be used to count total mispredictions, total conditional mispredictions, and total call mispredictions.

The complete control-glow graph of conditional_call looks like this:

           total   misp
call         1      0
    jl       1     0.5
       ret  0.5     1
    ret      1      0
jne          1      0

The first call instruction calls the conditional_call function, which contains jl and ret. The jl instruction conditionally jumps to the negate function, which contains ret. The jne instruction is used to for the loop. The numbers shown in the first and second column are normalized by the total number of iterations and total number of dynamic instructions, respectively. We know from the static structure of the program that call, jl, conditional_call's ret, and jne are each executed once in every iteration. The inner most ret is only executed when the jl branch is taken. Using the performance events, we can count the total number of executed return instructions and subtract from it the total number of iterations to get the number of times the inner most ret is executed. Because the input is randomized according to the uniform distribution, it shouldn't be surprising that the inner most ret is executed half of the time.

The call instruction is never mispredicted. The jne instruction is also never mispredicted except for the last execution of the instructions (where it exits the loop). Therefore, we can attribute the total number of conditional mispredictions to the jl instruction. That can be subtracted from the total number of mispredictions to get the number of return mispredictions which can be attributed to either or both of the return instructions. The second ret may mispredict when the misprediction of the first ret clobbers or misaligns the RAS. One way to determine whether the second ret is ever mispredicted is by using precise sampling of BR_MISP_RETIRED.ALL_BRANCHES. Another way is by using the method described in the blog post you cited. Indeed, only the inner most ret is mispredicted. The fact that jl is mispredicted half of the time suggests that the instruction is either being predicted always taken or always not taken.

The complete control-glow graph of branch_over_call looks like this:

           total   misp
call         1      0
    jg       1     0.5
    call    0.5     0
        ret 0.5     0
    ret      1      0
jne          1      0

The only instruction that is mispredicted is jg, which is mispredicted half of the time.

To measure the average cost of a single ret misprediction in the conditional_call approach, the ret instruction can be replaced with a lea/jmp sequence so that BTB rather than the RAS is used for making predictions. With this change, the only instruction that is mispredicted is jl. The difference in execution time can be considered as an estimate for the total cost of ret mispredictions. On my CFL processor, this is about 11.3 cycles per ret misprediction. In addition, conditional_call has become about 3% faster than branch_over_call. Your numbers on KBL indicate that the average cost of a ret misprediction is about 13 cycles. I'm not sure what the reason for this difference is. It may not be microarchitectural. I've used gcc 7.3 but you used gcc 8, so perhaps there are some differences in the code or the alignments of different pieces of code that are causing the discrepancy between our results.

这篇关于执行“条件呼叫”。在amd64上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 02:05