Haswell AVX/FMA延迟测试时间比英特尔指南慢了1个周期

本文介绍了Haswell AVX/FMA延迟测试时间比英特尔指南慢了1个周期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在《英特尔技术指南》中， vmulpd 和 vfmadd213pd 的延迟为5， vaddpd 的延迟为3.

In Intel Intrinsics Guide, vmulpd and vfmadd213pd has latency of 5, vaddpd has latency of 3.

我写了一些测试代码，但所有结果都慢了1个周期.

I write some test code, but all of the results are 1 cycle slower.

这是我的测试代码:

.CODE
test_latency PROC
    vxorpd  ymm0, ymm0, ymm0
    vxorpd  ymm1, ymm1, ymm1

loop_start:
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    sub     rcx, 4
    jg      loop_start

    ret
test_latency ENDP
END

#include <stdio.h>
#include <omp.h>
#include <stdint.h>
#include <windows.h>

extern "C" void test_latency(int64_t n);

int main()
{
    SetThreadAffinityMask(GetCurrentThread(), 1);   // Avoid context switch

    int64_t n = (int64_t)3e9;
    double start = omp_get_wtime();
    test_latency(n);
    double end = omp_get_wtime();
    double time = end - start;

    double freq = 3.3e9;    // My CPU frequency
    double latency = freq * time / n;
    printf("latency = %f\n", latency);
}

我的CPU是Core i5 4590，我将其频率锁定在3.3GHz.输出为: latency = 6.102484 .

My CPU is Core i5 4590, I locked its frequency at 3.3GHz. The output is: latency = 6.102484.

足够奇怪，如果我将 vmulpd ymm0，ymm0，ymm1 更改为 vmulpd ymm0，ymm0，ymm0 ，则输出将变为: latency = 5.093745 .

Strange enough, if I change vmulpd ymm0, ymm0, ymm1 to vmulpd ymm0, ymm0, ymm0, then the output become: latency = 5.093745.

有解释吗?我的测试代码有问题吗?

Is there an explanation? Is my test code problematic?

更多结果

results on Core i5 4590 @3.3GHz
vmulpd  ymm0, ymm0, ymm1       6.056094
vmulpd  ymm0, ymm0, ymm0       5.054515
vaddpd  ymm0, ymm0, ymm1       4.038062
vaddpd  ymm0, ymm0, ymm0       3.029360
vfmadd213pd ymm0, ymm0, ymm1   6.052501
vfmadd213pd ymm0, ymm1, ymm0   6.053163
vfmadd213pd ymm0, ymm1, ymm1   6.055160
vfmadd213pd ymm0, ymm0, ymm0   5.041532

(without vzeroupper)
vmulpd  xmm0, xmm0, xmm1       6.050404
vmulpd  xmm0, xmm0, xmm0       5.042191
vaddpd  xmm0, xmm0, xmm1       4.044518
vaddpd  xmm0, xmm0, xmm0       3.024233
vfmadd213pd xmm0, xmm0, xmm1   6.047219
vfmadd213pd xmm0, xmm1, xmm0   6.046022
vfmadd213pd xmm0, xmm1, xmm1   6.052805
vfmadd213pd xmm0, xmm0, xmm0   5.046843

(with vzeroupper)
vmulpd  xmm0, xmm0, xmm1       5.062350
vmulpd  xmm0, xmm0, xmm0       5.039132
vaddpd  xmm0, xmm0, xmm1       3.019815
vaddpd  xmm0, xmm0, xmm0       3.026791
vfmadd213pd xmm0, xmm0, xmm1   5.043748
vfmadd213pd xmm0, xmm1, xmm0   5.051424
vfmadd213pd xmm0, xmm1, xmm1   5.049090
vfmadd213pd xmm0, xmm0, xmm0   5.051947

(without vzeroupper)
mulpd   xmm0, xmm1             5.047671
mulpd   xmm0, xmm0             5.042176
addpd   xmm0, xmm1             3.019492
addpd   xmm0, xmm0             3.028642

(with vzeroupper)
mulpd   xmm0, xmm1             5.046220
mulpd   xmm0, xmm0             5.057278
addpd   xmm0, xmm1             3.025577
addpd   xmm0, xmm0             3.031238

我的GUESS

我这样更改了 test_latency :

.CODE
test_latency PROC
    vxorpd  ymm0, ymm0, ymm0
    vxorpd  ymm1, ymm1, ymm1

loop_start:
    vaddpd  ymm1, ymm1, ymm1  ; added this line
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    vmulpd  ymm0, ymm0, ymm1
    sub     rcx, 4
    jg      loop_start

    ret
test_latency ENDP
END

最后我得到5个周期的结果.还有其他说明可以达到相同的效果:

Finally I get the result of 5 cycle. There are other instructions to achieve the same effect:

vmovupd     ymm1, ymm0
vmovupd     ymm1, [mem]
vmovdqu     ymm1, [mem]
vxorpd      ymm1, ymm1, ymm1
vpxor       ymm1, ymm1, ymm1
vmulpd      ymm1, ymm1, ymm1
vshufpd     ymm1, ymm1, ymm1, 0

但是这些说明不能:

vmovupd     ymm1, ymm2  ; suppose ymm2 is zeroed
vpaddq      ymm1, ymm1, ymm1
vpmulld     ymm1, ymm1, ymm1
vpand       ymm1, ymm1, ymm1

对于ymm指令，我想避免1个额外周期的条件是:

In the case of ymm instructions, I guess the conditions to avoid 1 extra cycle are:

所有输入均来自同一域.
所有输入都足够新鲜.(从旧的价值中移走是行不通的)

对于VEX xmm，情况似乎有些模糊.看来与上半部状态有关，但我不知道哪个更清洁:

As for VEX xmm, the condition seems a little blur. It seems related to upper half state, but I don't know which one is cleaner:

vxorpd      ymm1, ymm1, ymm1
vxorpd      xmm1, xmm1, xmm1
vzeroupper

对我来说很困难.

`推荐答案`

自从在Skylake上注意到它以来，我几年来一直在写一些有关此的东西. https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#after-an-integer-to-fp-bypass-latency-can-不确定地增加

I've been meaning to write something up about this for a few years now, since noticing it on Skylake. https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#after-an-integer-to-fp-bypass-latency-can-be-increased-indefinitely

绕过延迟等待时间是粘性的":整数SIMD指令可以感染"数据.将来所有读取该值的指令，即使指令执行很久之后也是如此.我对感染"感到惊讶在调零习惯中幸存下来，尤其是像 vxorpd 这样的FP调零指令，但是我可以在SKL(i7-6700k上重现这种效果，使用 perf 在Linux上，而不是弄乱时间和频率.)

Bypass-delay latency is "sticky": an integer SIMD instruction can "infect" all future instructions that read that value, even long after the instruction is done. I'm surprised that "infection" survived across a zeroing idiom, especially an FP zeroing instruction like vxorpd, but I can reproduce that effect on SKL (i7-6700k, counting clock cycles directly in a test loop with perf on Linux instead of messing around with time and frequency.)

(在Skylake上，似乎在循环发生之前，连续有3条或更多的 vxorpd 调零指令有效，从而消除了额外的旁路延迟.em>被消除了，不同于有时会失败的mov-消除，但是也许区别只是在向后端发布> vpaddb 和第一个 vmulpd 之间造成了一定的差距.；在我的测试循环中，脏"/污染了循环之前的寄存器.)

(On Skylake, it seems 3 or more vxorpd zeroing instructions in a row before the loop happen to work, removing the extra bypass latency. AFAIK, xor-zeroing is always eliminated, unlike mov-elimination which sometimes fails. But perhaps the difference is just in creating a gap between issue of the vpaddb into the back-end and the first vmulpd; in my test loop I "dirty" / pollute the register right before the loop.)

大概在调用方中以前使用YMM1涉及整数指令.(TODO:研究寄存器进入此状态有多普遍，以及何时可以在异或归零后幸存！我希望只有在使用整数指令构造FP位模式(包括诸如 vpcmpeqd之类的东西)时才会发生这种情况ymm1，ymm1，ymm1 生成-NaN(全1).

Presumably some previous use of YMM1 in the caller involved an integer instruction. (TODO: investigate how common it is for a register to get into this state, and when it can survive xor-zeroing! I expected it to only happen when constructing an FP bit-pattern with integer instructions, including stuff like vpcmpeqd ymm1,ymm1,ymm1 to make a -NaN (all-one bits).)

在Skylake上，我可以通过在xor归零之后执行 vaddpd ymm1，ymm1，ymm1 来修复此问题.(或者之前；这可能无关紧要！这可能是最佳选择，将其放在上一个dep链的末尾而不是它的开头.)

On Skylake I can fix it by doing vaddpd ymm1, ymm1, ymm1 before the loop, after the xor-zeroing. (Or before; it might not matter! That might be more optimal, putting it at the end of the previous dep chain instead of the start of this.)

正如我写的

这篇关于Haswell AVX/FMA延迟测试时间比英特尔指南慢了1个周期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！