本文介绍了Haswell/Skylake上的部分寄存器的性能如何?编写AL似乎对RAX有错误的依赖关系,而AH是不一致的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此循环在Intel Conroe/Merom上每3个循环运行一次迭代,如预期那样,瓶颈限制在imul吞吐量上.但是在Haswell/Skylake上,它每11个周期运行一次迭代,这显然是因为setnz al对最后一个imul具有依赖性.

; synthetic micro-benchmark to test partial-register renaming
    mov     ecx, 1000000000
.loop:                 ; do{
    imul    eax, eax     ; a dep chain with high latency but also high throughput
    imul    eax, eax
    imul    eax, eax

    dec     ecx          ; set ZF, independent of old ZF.  (Use sub ecx,1 on Silvermont/KNL or P4)
    setnz   al           ; ****** Does this depend on RAX as well as ZF?
    movzx   eax, al
    jnz  .loop         ; }while(ecx);

如果setnz al依赖于rax,则3ximul/setcc/movzx序列会形成循环承载的依赖链.如果不是,则每个setcc/movzx/3x imul链都是独立的,与更新循环计数器的dec分叉.在HSW/SKL上测得的每次迭代11c可以很好地解释为延迟瓶颈:3x3c(imul)+ 1c(由setcc进行读取-修改-写入)+ 1c(同一寄存器内的movzx).


主题外:避免这些(故意的)瓶颈

我本来希望通过可理解/可预测的行为来隔离部分法规的内容,而不是最佳性能.

例如,xor -zero/set-flags/setcc总是更好(在这种情况下,为xor eax,eax/dec ecx/setnz al).这打破了所有CPU(早期的P6-系列,如PII和PIII)上eax的依赖,仍然避免了部分寄存器合并的处罚,并节省了1c的movzx延迟.它还在.有关在setcc上使用异或归零的更多信息,请参见该链接.

请注意,AMD,Intel Silvermont/KNL和P4根本不进行部分注册重命名.它只是Intel P6系列CPU及其后代Intel Sandybridge系列的一项功能,但似乎正在逐步淘汰.

不幸的是,

gcc确实倾向于使用cmp/setcc al/movzx eax,al,而它本可以使用xor而不是movzx (Godbolt编译器-浏览器示例),而clang使用异或零/cmp/setcc,除非您将多个布尔条件(例如count += (a==b) | (a==~b))组合在一起.

xor/dec/setnz版本在Skylake,Haswell和Core2上的每次迭代运行速度为3.0c(在imul吞吐量上遇到瓶颈). xor -zeroing打破了除PPro/PII/PIII/early-Pentium-M之外的所有乱序CPU上对eax的旧值的依赖(在这种情况下,它仍避免部分寄存器合并的惩罚,但不会这样做). t中断dep). Agner Fog的微体系结构指南对此进行了描述.在Core2上用mov eax,0替换Xor-zeroing将其减慢至每4.78个周期之一: 2-3c停顿(在前端?),当imulsetnz al之后读取eax.

此外,我使用了movzx eax, al,它击败了mov-elimination,就像mov rax,rax一样. (IvB,HSW和SKL可以使用0延迟来重命名movzx eax, bl,但Core2不能.)除了部分注册行为外,这使Core2/SKL的所有内容均相等.


Core2行为与 Agner Fog的微体系结构指南一致,但HSW/SKL行为却不一致.从Skylake的11.10节开始,与以前的Intel uarches相同:

不幸的是,他没有时间对每个新的uarch进行详细的测试来重新测试假设,所以这种行为上的变化贯穿了整个裂缝.

Agner确实描述了通过Skylake在Sandybridge上为high8寄存器(AH/BH/CH/DH)以及为SnB上的low8/low16插入(不停顿)合并的uop(无停顿). (很遗憾,我过去一直散布错误信息,并说Haswell可以免费合并AH.我过快地浏览了Agner的Haswell部分,并且没有注意到关于high8寄存器的下一段.让我知道是否看到我在其他帖子上的错误评论,因此我可以删除它们或添加更正.我将尝试至少在我所说的地方查找和编辑我的答案.)


我的实际问题:部分寄存器在Skylake上的实际表现如何?

从IvyBridge到Skylake,包括高8倍的额外延迟,一切都一样吗?

英特尔的优化手册并未具体说明哪些CPU具有错误的依赖性(尽管它确实提到某些CPU拥有错误的依赖性),并省略了诸如读取AH/BH/CH/DH(high8寄存器)即使未进行修改也会增加额外的延迟.

如果Agner Fog的微体系结构指南没有描述任何P6家族(Core2/Nehalem)行为,那也很有趣,但是我应该将这个问题的范围限制为Skylake或Sandybridge家族./p>


我的Skylake测试数据,是通过将%rep 4短序列放入运行100M或1G迭代的小型dec ebp/jnz循环中进行的.我使用.

除非另有说明,否则每条指令都使用ALU执行端口作为1个融合域uop运行. (使用 ocperf.py stat -e ...,uops_issued.any,uops_executed.thread 测量).这样可以检测到运动缺失和多余的合并信号.

每个周期4个"案例是对无限展开案例的推断.循环开销占用了一些前端带宽,但是每个周期大于1的任何情况都表明寄存器重新命名避免了写后写输出依赖项,并且uop不会在内部作为读-修改-写处理.

仅写为AH :阻止从回送缓冲区(又名循环流检测器"(LSD))执行循环. lsd.uops的计数在HSW上恰好为0,而在SKL上则很小(约1.8k),并且不随循环迭代计数而扩展.这些计数可能来自某些内核代码.当环路确实从LSD运行时,lsd.uops ~= uops_issued到测量噪声范围内.有些循环在LSD或no-LSD之间交替(例如,如果解码从错误的位置开始,则它们可能不适合uop缓存),但是我在测试此循环时并未遇到这种情况.

  • 重复的mov ah, bh和/或mov ah, bl每个循环运行4次.它需要一个ALU uop,因此不会像mov eax, ebx那样被消除.
  • 重复的mov ah, [rsi]每周期运行2个(负载吞吐量瓶颈).
  • 重复的mov ah, 123每个周期运行1个. (一个消除了瓶颈.)
  • 重复的setz ahsetc ah每个周期运行1次. (破折号xor eax,eax使它成为setcc和循环分支的p06吞吐量的瓶颈.)

    为什么用通常会使用ALU执行单元的指令编写ah对旧值有错误的依赖,而mov r8, r/m8没有(对于reg或memory src)呢?(那么mov r/m8, r8呢?您用于reg-reg移动的两个操作码中的哪个当然无关紧要?)

  • 重复的add ah, 123按预期每个周期运行1次.

  • 重复的add dh, cl每个周期运行1个.
  • 重复的add dh, dh每个周期运行1个.
  • 重复的add dh, ch每个循环运行0.5次.当[ABCD] H是干净的"时,它的读法很特殊(在这种情况下,RCX根本没有被修改).

术语:所有这些都留下AH(或DH)"",即当其余的寄存器位于阅读(或在其他情况下).也就是说,如果我正确理解的话,AH会与RAX分开重命名.相反,"干净".清除脏寄存器的方法有很多,最简单的方法是inc eaxmov eax, esi.

仅写为AL :这些循环确实是从LSD运行的:uops_issue.any〜= lsd.uops.

  • 重复的mov al, bl每个周期运行1个.每组偶尔会出现dep-breaking xor eax,eax,使得OOO执行瓶颈取决于uop吞吐量,而不是延迟.
  • 重复的
  • mov al, [rsi]每个周期以微融合ALU +负载uop运行1次. (uops_issued = 4G +循环开销,uops_executed = 8G +循环开销).一组4之前的破折号xor eax,eax使其在每个时钟2个负载上成为瓶颈.
  • 重复的mov al, 123每个周期运行1次.
  • 重复的mov al, bh每个周期以0.5运行. (每2个周期1个).读[ABCD] H很特别.
  • xor eax,eax + 6x mov al,bh + dec ebp/jnz:每个迭代2c,瓶颈是每个时钟4 oups的前端.
  • 重复的add dl, ch每个循环运行0.5次. (每2个周期1个).读取[ABCD] H显然会为dl带来额外的延迟.
  • 重复的add dl, cl每个周期运行1个.

我认为写入低8位寄存器的行为就像RMW混合到完整reg中一样,就像add eax, 123一样,但是如果ah脏了,它不会触发合并.因此(除了忽略AH合并外),它的行为与根本不执行部分reg重命名的CPU相同.似乎AL从未与RAX单独重命名?

  • inc al/inc ah对可以并行运行.
  • 如果ah为脏",则
  • mov ecx, eax会插入合并的uop,但实际的mov会被重命名.这就是 Agner Fog为IvyBridge和更高版本所描述的内容.
  • 重复的movzx eax, ah每2个周期运行一次. (在写入完整的寄存器后读取高8位寄存器会产生额外的延迟.)
  • movzx ecx, al的延迟为零,并且不占用HSW和SKL上的执行端口. (就像艾格纳·福格(Agner Fog)为IvyBridge所描述的那样,但他说HSW不会重命名movzx.)
  • movzx ecx, cl的延迟时间为1c,占用执行端口. ( mov -消除对于same,same情况永远不起作用,仅在不同的体系结构寄存器之间有效.)

    不能从LSD(循环缓冲区)运行每次迭代都插入合并uop的循环吗?

我认为AL/AH/RAX与B *,C *,DL/DH/RDX没什么特别的.我已经用其他寄存器中的部分寄存器进行了测试(即使我大多数情况下都显示AL/AH的一致性),但从未发现任何区别.

我们如何用微拱内部运作的明智模型来解释所有这些观察结果?


相关:部分标志问题与部分注册问题不同.请参阅 INC指令与添加1:有关系吗?对于某些带有shr r32,cl的超级怪异的东西(甚至在Core2/Nehalem上甚至是shr r32,2:除了移位1以外,不要读取标志).

另请参见问题在某些CPU上使用ADC/SBB和INC/DEC进行紧密循环,以处理adc循环中的部分标志.

解决方案

其他答案也可以解决Sandybridge和IvyBridge的更多细节. 我无法使用该硬件.


我没有发现HSW和SKL之间的任何部分注册行为差异. 在Haswell和Skylake上,到目前为止我测试过的所有东西都支持该模型:

AL永远不会与RAX分开重命名(或r15中的r15b).因此,如果您从不触摸high8寄存器(AH/BH/CH/DH),则所有操作的行为都与没有部分寄存器重命名的CPU上的行为完全相同(例如AMD).

对AL的仅写访问合并到RAX中,并依赖于RAX.对于加载到AL中,这是一个在p0156上执行的微融合ALU +加载uop,这是最有力的证据之一,它确实在每次写入时都合并,而不仅仅是像Agner推测的那样进行了一些花哨的双重记账.

Agner(和英特尔)表示,Sandybridge可能需要AL的合并uop,因此它可能与RAX分开重命名.对于SnB,英特尔的优化手册(第3.5.2.4部分寄存器停顿)

我认为他们是说在SnB上,add al,bl将RMW完整的RAX,而不是分别重命名,因为源寄存器之一是RAX(的一部分).我的猜测是,这不适用于mov al, [rbx + rax]之类的负载; rax在寻址模式下可能不算作源.

我还没有测试过high8合并的uops是否仍必须在HSW/SKL上自行发布/重命名.这样一来,前端影响就等于4微秒(因为这是问题/重命名管道的宽度).

  • 如果不编写EAX/RAX,就无法打破涉及AL的依赖关系. xor al,al没有帮助,mov al, 0也没有帮助.
  • movzx ebx, al具有,并且不需要执行单元.(例如,在HSW和SKL上进行消除运动). 如果脏了,它会触发AH的合并,我想这对于在没有ALU的情况下起作用是必要的.英特尔在引入mov-elimination的同一个广告中放弃了low8重命名,这可能不是巧合. (阿格纳·福格(Agner Fog)的微拱指南在这里犯了一个错误,说在高速钢或SKL上,只有IvB不能消除零延伸的动作.)
  • 重命名时会删除
  • movzx eax, al而不是 .英特尔上的mov-elimination永远都无法做到这一点.即使不必将mov rax,rax零扩展任何内容,也不会被删除. (尽管没有必要为其提供特殊的硬件支持,因为与mov eax,eax不同,它只是一个无操作的工具).无论如何,无论是32位mov还是8位movzx,零扩展时都希望在两个单独的体系结构寄存器之间移动.
  • 在HSW或SKL上重命名时,不会删除
  • movzx eax, bx .它具有1c的延迟,并使用ALU uop.英特尔的优化手册只提到了8位movzx的零延迟(并指出movzx r32, high8从未重命名).

高8位寄存器可以与其余寄存器分开重命名,并且确实需要合并uops.

  • 使用mov ah, reg8mov ah, [mem8]以纯写方式访问ah会重命名AH,而不依赖于旧值.这些都是32位版本通常不需要ALU uop的指令. (但是mov ah, bl没有被 消除;它确实需要一个p0156 ALU uop,所以可能是巧合.)
  • AH的RMW(如inc ah)弄脏了它.
  • setcc ah取决于旧的ah,但仍会弄脏.我认为mov ah, imm8是相同的,但是没有测试那么多极端情况.

    (原因不明:有时可以从LSD运行涉及setcc ah的循环,请参见本文结尾处的rcr循环.也许只要ah末端是干净的可以使用LSD吗?).

    如果ah脏了,则setcc ah合并到重命名的ah中,而不是强制合并到rax中.例如%rep 4(inc al/test ebx,ebx/setcc ah/inc al/inc ah)不生成合并uops,仅在8.7c(8 inc al的延迟)中运行,这是由于资源冲突导致的. ah.inc ah/setcc ah dep链).

    我认为这里发生的是setcc r8始终以读-修改-写方式实现.英特尔可能认为不应该使用仅写setcc uop来优化setcc ah情况,因为对于编译器生成的代码而言,setcc ah很少见. (但是请参阅问题中的Godbolt链接:clang4.0和-m32将会这样做.)

  • 读取AX,EAX或RAX会触发合并uop(这会占用前端问题/重命名带宽).可能是RAT(寄存器分配表)跟踪体系结构R [ABCD] X的高8位状态,并且即使在写完AH退休后,AH数据也存储在与RAX分开的物理寄存器中.即使在编写AH和阅读EAX之间有256个NOP,也存在一个额外的合并uop. (SKB上的ROB大小= 224,因此可以保证mov ah, 123被淘汰).用uops_issued/execute perf计数器检测到,可以清楚地显示出差异.

  • AL的读-修改-写操作(例如inc al)免费合并,作为ALU uop的一部分. (仅使用一些简单的指令进行测试,例如add/inc,而不是div r8mul r8).同样,即使AH脏了也不会触发合并uop.

  • 仅写入EAX/RAX(例如lea eax, [rsi + rcx]或)清除AH脏状态(不合并uop).

  • 仅写入AX(mov ax, 1)会首先触发AH的合并.我想它可以像其他任何RMW AX/RAX一样运行,而不是对其进行特殊处理. (TODO:测试mov ax, bx,尽管这应该是不特殊的,因为它没有重命名.)
  • xor ah,ah的延迟时间为1c,不会中断,并且仍需要执行端口.
  • 对AL的读取和/或写入不会强制合并,因此AH可能会变脏(并在单独的dep链中独立使用). (例如add ah, cl/add al, dl可以每时钟1个运行(在增加延迟方面遇到瓶颈).

使AH变脏可防止从LSD(循环缓冲区)运行循环,即使没有合并的uops也是如此. LSD是指CPU在提供问题/重命名阶段的队列中回收uop的时间. (称为IDQ).

插入合并的uops有点像为堆栈引擎插入堆栈同步的uops.英特尔的优化手册说,SnB的LSD不能运行不匹配的push/pop循环,这是有道理的,但它暗示它可以运行平衡的push/pop循环. .这不是我在SKL上看到的:即使均衡的push/pop也会阻止从LSD运行(例如push rax/pop rdx/times 6 imul rax, rdx.)(SnB的LSD和HSW之间可能存在真正的差异/SKL: SnB可能只是锁定" IDQ中的微指令,而不是重复多次,因此5微循环需要2个周期而不是1.25来发出.)无论如何,似乎HSW/SKL无法使用高8位寄存器变脏或包含堆栈引擎uops时LSD.

此行为可能与 SKL中的勘误:

这也可能与英特尔的优化手册中的陈述有关,即SnB至少必须自己循环发出/重命名AH合并uop.对于前端来说,这是一个很奇怪的区别.

我的Linux内核日志显示microcode: sig=0x506e3, pf=0x2, revision=0x84.Arch Linux的intel-ucode软件包仅提供了更新,您必须编辑配置文件才能实际拥有它加载了.因此,我的Skylake测试是在i7-6700k上进行的,其微代码修订版为0x84,不包括SKL150的修复程序 .在我测试的每种情况下,它都与Haswell行为相匹配,即IIRC. (例如,Haswell和我的SKL都可以从LSD运行setne ah/add ah,ah/rcr ebx,1/mov eax,ebx循环).我启用了HT(这是SKL150出现的前提条件),但是我在一个大多为空闲的系统上进行测试,因此我的线程本身具有核心.

使用更新的微代码,LSD一直在所有时间被完全禁用,而不仅仅是部分寄存器处于活动状态. lsd.uops始终完全为零,包括对于真实程序而不是合成循环.硬件错误(而不是微代码错误)通常需要禁用整个功能才能修复.这就是为什么SKL-avx512(SKX)是报告为没有回送缓冲区.幸运的是,这不是性能问题:SKL在Broadwell上增加的uop-cache吞吐量几乎总是可以跟上问题/重命名的速度.


额外的AH/BH/CH/DH延迟时间:

  • 在AH不脏的情况下读取AH(单独重命名)会为两个操作数增加一个额外的延迟周期.例如add bl, ah从输入BL到输出BL的延迟为2c,因此即使RAX和AH不属于关键路径,它也可以增加关键路径的延迟. (我之前已经在其他操作数上看到过这种额外的延迟,在Skylake上有矢量延迟,其中int/float延迟会永远污染"寄存器.TODO:写下来.)

这意味着与movzx/shr eax,8/movzx相比,使用movzx ecx, al/movzx edx, ah解压缩字节具有额外的延迟,但是吞吐量仍然更高.

  • 脏时读取AH不会增加任何延迟. (add ah,ahadd ah,dh/add dh,ah每次添加都有1c的延迟).在许多极端情况下,我还没有做很多测试来确认这一点.

    假设:肮脏的high8值存储在物理寄存器的底部.读取干净的high8需要进行移位以提取位[15:8],但是读取脏的high8只能占用物理寄存器的位[7:0],就像读取普通的8位寄存器一样.

额外的延迟并不意味着吞吐量降低.即使所有add指令都具有2c延迟(从读取DH(未修改)来看),该程序也可以每2个时钟1次运行.

global _start
_start:
    mov     ebp, 100000000
.loop:
    add ah, dh
    add bh, dh
    add ch, dh
    add al, dh
    add bl, dh
    add cl, dh
    add dl, dh

    dec ebp
    jnz .loop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)
 Performance counter stats for './testloop':

     48.943652      task-clock (msec)         #    0.997 CPUs utilized          
             1      context-switches          #    0.020 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             3      page-faults               #    0.061 K/sec                  
   200,314,806      cycles                    #    4.093 GHz                    
   100,024,930      branches                  # 2043.675 M/sec                  
   900,136,527      instructions              #    4.49  insn per cycle         
   800,219,617      uops_issued_any           # 16349.814 M/sec                 
   800,219,014      uops_executed_thread      # 16349.802 M/sec                 
         1,903      lsd_uops                  #    0.039 M/sec                  

   0.049107358 seconds time elapsed


一些有趣的测试循环主体:

%if 1
     imul eax,eax
     mov  dh, al
     inc dh
     inc dh
     inc dh
;     add al, dl
    mov cl,dl
    movzx eax,cl
%endif

Runs at ~2.35c per iteration on both HSW and SKL.  reading `dl` has no dep on the `inc dh` result.  But using `movzx eax, dl` instead of `mov cl,dl` / `movzx eax,cl` causes a partial-register merge, and creates a loop-carried dep chain.  (8c per iteration).


%if 1
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax         ; off the critical path unless there's a false dep

  %if 1
    test  ebx, ebx          ; independent of the imul results
    ;mov   ah, 123         ; dependent on RAX
    ;mov  eax,0           ; breaks the RAX dependency
    setz  ah              ; dependent on RAX
  %else
    mov   ah, bl          ; dep-breaking
  %endif

    add   ah, ah
    ;; ;inc   eax
;    sbb   eax,eax

    rcr   ebx, 1      ; dep on  add ah,ah  via CF
    mov   eax,ebx     ; clear AH-dirty

    ;; mov   [rdi], ah
    ;; movzx eax, byte [rdi]   ; clear AH-dirty, and remove dep on old value of RAX
    ;; add   ebx, eax          ; make the dep chain through AH loop-carried
%endif

setcc版本(带有%if 1)具有20c循环延迟,即使具有setcc ahadd ah,ah,它也可以从LSD运行.

00000000004000e0 <_start.loop>:
  4000e0:       0f af c0                imul   eax,eax
  4000e3:       0f af c0                imul   eax,eax
  4000e6:       0f af c0                imul   eax,eax
  4000e9:       0f af c0                imul   eax,eax
  4000ec:       0f af c0                imul   eax,eax
  4000ef:       85 db                   test   ebx,ebx
  4000f1:       0f 94 d4                sete   ah
  4000f4:       00 e4                   add    ah,ah
  4000f6:       d1 db                   rcr    ebx,1
  4000f8:       89 d8                   mov    eax,ebx
  4000fa:       ff cd                   dec    ebp
  4000fc:       75 e2                   jne    4000e0 <_start.loop>

 Performance counter stats for './testloop' (4 runs):

       4565.851575      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.08% )
                 4      context-switches          #    0.001 K/sec                    ( +-  5.88% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.001 K/sec                  
    20,007,739,240      cycles                    #    4.382 GHz                      ( +-  0.00% )
     1,001,181,788      branches                  #  219.276 M/sec                    ( +-  0.00% )
    12,006,455,028      instructions              #    0.60  insn per cycle           ( +-  0.00% )
    13,009,415,501      uops_issued_any           # 2849.286 M/sec                    ( +-  0.00% )
    12,009,592,328      uops_executed_thread      # 2630.307 M/sec                    ( +-  0.00% )
    13,055,852,774      lsd_uops                  # 2859.456 M/sec                    ( +-  0.29% )

       4.565914158 seconds time elapsed                                          ( +-  0.08% )

无法解释:即使它使AH变脏,它也从LSD运行. (至少我认为是这样.TODO:尝试在mov eax,ebx清除之前添加一些对eax做某事的指令.)

但是使用mov ah, bl时,它在HSW/SKL上的每次迭代都以5.0c运行(imul吞吐量瓶颈). (注释掉的存储/重新加载也可以,但是SKL的存储转发比HSW快,它的 ...)

 #  mov ah, bl   version
 5,009,785,393      cycles                    #    4.289 GHz                      ( +-  0.08% )
 1,000,315,930      branches                  #  856.373 M/sec                    ( +-  0.00% )
11,001,728,338      instructions              #    2.20  insn per cycle           ( +-  0.00% )
12,003,003,708      uops_issued_any           # 10275.807 M/sec                   ( +-  0.00% )
11,002,974,066      uops_executed_thread      # 9419.678 M/sec                    ( +-  0.00% )
         1,806      lsd_uops                  #    0.002 M/sec                    ( +-  3.88% )

   1.168238322 seconds time elapsed                                          ( +-  0.33% )

请注意,它不再从LSD运行.

This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul.

; synthetic micro-benchmark to test partial-register renaming
    mov     ecx, 1000000000
.loop:                 ; do{
    imul    eax, eax     ; a dep chain with high latency but also high throughput
    imul    eax, eax
    imul    eax, eax

    dec     ecx          ; set ZF, independent of old ZF.  (Use sub ecx,1 on Silvermont/KNL or P4)
    setnz   al           ; ****** Does this depend on RAX as well as ZF?
    movzx   eax, al
    jnz  .loop         ; }while(ecx);

If setnz al depends on rax, the 3ximul/setcc/movzx sequence forms a loop-carried dependency chain. If not, each setcc/movzx/3ximul chain is independent, forked off from the dec that updates the loop counter. The 11c per iteration measured on HSW/SKL is perfectly explained by a latency bottleneck: 3x3c(imul) + 1c(read-modify-write by setcc) + 1c(movzx within the same register).


Off topic: avoiding these (intentional) bottlenecks

I was going for understandable / predictable behaviour to isolate partial-reg stuff, not optimal performance.

For example, xor-zero / set-flags / setcc is better anyway (in this case, xor eax,eax / dec ecx / setnz al). That breaks the dep on eax on all CPUs (except early P6-family like PII and PIII), still avoids partial-register merging penalties, and saves 1c of movzx latency. It also uses one fewer ALU uop on CPUs that handle xor-zeroing in the register-rename stage. See that link for more about using xor-zeroing with setcc.

Note that AMD, Intel Silvermont/KNL, and P4, don't do partial-register renaming at all. It's only a feature in Intel P6-family CPUs and its descendant, Intel Sandybridge-family, but seems to be getting phased out.

gcc unfortunately does tend to use cmp / setcc al / movzx eax,al where it could have used xor instead of movzx (Godbolt compiler-explorer example), while clang uses xor-zero/cmp/setcc unless you combine multiple boolean conditions like count += (a==b) | (a==~b).

The xor/dec/setnz version runs at 3.0c per iteration on Skylake, Haswell, and Core2 (bottlenecked on imul throughput). xor-zeroing breaks the dependency on the old value of eax on all out-of-order CPUs other than PPro/PII/PIII/early-Pentium-M (where it still avoids partial-register merging penalties but doesn't break the dep). Agner Fog's microarch guide describes this. Replacing the xor-zeroing with mov eax,0 slows it down to one per 4.78 cycles on Core2: 2-3c stall (in the front-end?) to insert a partial-reg merging uop when imul reads eax after setnz al.

Also, I used movzx eax, al which defeats mov-elimination, just like mov rax,rax does. (IvB, HSW, and SKL can rename movzx eax, bl with 0 latency, but Core2 can't). This makes everything equal across Core2 / SKL, except for the partial-register behaviour.


The Core2 behaviour is consistent with Agner Fog's microarch guide, but the HSW/SKL behaviour isn't. From section 11.10 for Skylake, and same for previous Intel uarches:

He unfortunately doesn't have time to do detailed testing for every new uarch to re-test assumptions, so this change in behaviour slipped through the cracks.

Agner does describe a merging uop being inserted (without stalling) for high8 registers (AH/BH/CH/DH) on Sandybridge through Skylake, and for low8/low16 on SnB. (I've unfortunately been spreading mis-information in the past, and saying that Haswell can merge AH for free. I skimmed Agner's Haswell section too quickly, and didn't notice the later paragraph about high8 registers. Let me know if you see my wrong comments on other posts, so I can delete them or add a correction. I will try to at least find and edit my answers where I've said this.)


My actual questions: How exactly do partial registers really behave on Skylake?

Is everything the same from IvyBridge to Skylake, including the high8 extra latency?

Intel's optimization manual is not specific about which CPUs have false dependencies for what (although it does mention that some CPUs have them), and leaves out things like reading AH/BH/CH/DH (high8 registers) adding extra latency even when they haven't been modified.

If there's any P6-family (Core2/Nehalem) behaviour that Agner Fog's microarch guide doesn't describe, that would be interesting too, but I should probably limit the scope of this question to just Skylake or Sandybridge-family.


My Skylake test data, from putting %rep 4 short sequences inside a small dec ebp/jnz loop that runs 100M or 1G iterations. I measured cycles with Linux perf the same way as in my answer here, on the same hardware (desktop Skylake i7 6700k).

Unless otherwise noted, each instruction runs as 1 fused-domain uop, using an ALU execution port. (Measured with ocperf.py stat -e ...,uops_issued.any,uops_executed.thread). This detects (absence of) mov-elimination and extra merging uops.

The "4 per cycle" cases are an extrapolation to the infinitely-unrolled case. Loop overhead takes up some of the front-end bandwidth, but anything better than 1 per cycle is an indication that register-renaming avoided the write-after-write output dependency, and that the uop isn't handled internally as a read-modify-write.

Writing to AH only: prevents the loop from executing from the loopback buffer (aka the Loop Stream Detector (LSD)). Counts for lsd.uops are exactly 0 on HSW, and tiny on SKL (around 1.8k) and don't scale with the loop iteration count. Probably those counts are from some kernel code. When loops do run from the LSD, lsd.uops ~= uops_issued to within measurement noise. Some loops alternate between LSD or no-LSD (e.g when they might not fit into the uop cache if decode starts in the wrong place), but I didn't run into that while testing this.

  • repeated mov ah, bh and/or mov ah, bl runs at 4 per cycle. It takes an ALU uop, so it's not eliminated like mov eax, ebx is.
  • repeated mov ah, [rsi] runs at 2 per cycle (load throughput bottleneck).
  • repeated mov ah, 123 runs at 1 per cycle. (A dep-breaking xor eax,eax inside the loop removes the bottleneck.)
  • repeated setz ah or setc ah runs at 1 per cycle. (A dep-breaking xor eax,eax lets it bottleneck on p06 throughput for setcc and the loop branch.)

    Why does writing ah with an instruction that would normally use an ALU execution unit have a false dependency on the old value, while mov r8, r/m8 doesn't (for reg or memory src)? (And what about mov r/m8, r8? Surely it doesn't matter which of the two opcodes you use for reg-reg moves?)

  • repeated add ah, 123 runs at 1 per cycle, as expected.

  • repeated add dh, cl runs at 1 per cycle.
  • repeated add dh, dh runs at 1 per cycle.
  • repeated add dh, ch runs at 0.5 per cycle. Reading [ABCD]H is special when they're "clean" (in this case, RCX is not recently modified at all).

Terminology: All of these leave AH (or DH) "dirty", i.e. in need of merging (with a merging uop) when the rest of the register is read (or in some other cases). i.e. that AH is renamed separately from RAX, if I'm understanding this correctly. "clean" is the opposite. There are many ways to clean a dirty register, the simplest being inc eax or mov eax, esi.

Writing to AL only: These loops do run from the LSD: uops_issue.any ~= lsd.uops.

  • repeated mov al, bl runs at 1 per cycle. An occasional dep-breaking xor eax,eax per group lets OOO execution bottleneck on uop throughput, not latency.
  • repeated mov al, [rsi] runs at 1 per cycle, as a micro-fused ALU+load uop. (uops_issued=4G + loop overhead, uops_executed=8G + loop overhead).A dep-breaking xor eax,eax before a group of 4 lets it bottleneck on 2 loads per clock.
  • repeated mov al, 123 runs at 1 per cycle.
  • repeated mov al, bh runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H is special.
  • xor eax,eax + 6x mov al,bh + dec ebp/jnz: 2c per iter, bottleneck on 4 uops per clock for the front-end.
  • repeated add dl, ch runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H apparently creates extra latency for dl.
  • repeated add dl, cl runs at 1 per cycle.

I think a write to a low-8 reg behaves as a RMW blend into the full reg, like add eax, 123 would be, but it doesn't trigger a merge if ah is dirty. So (other than ignoring AH merging) it behaves the same as on CPUs that don't do partial-reg renaming at all. It seems AL is never renamed separately from RAX?

  • inc al/inc ah pairs can run in parallel.
  • mov ecx, eax inserts a merging uop if ah is "dirty", but the actual mov is renamed. This is what Agner Fog describes for IvyBridge and later.
  • repeated movzx eax, ah runs at one per 2 cycles. (Reading high-8 registers after writing full regs has extra latency.)
  • movzx ecx, al has zero latency and doesn't take an execution port on HSW and SKL. (Like what Agner Fog describes for IvyBridge, but he says HSW doesn't rename movzx).
  • movzx ecx, cl has 1c latency and takes an execution port. (mov-elimination never works for the same,same case, only between different architectural registers.)

    A loop that inserts a merging uop every iteration can't run from the LSD (loop buffer)?

I don't think there's anything special about AL/AH/RAX vs. B*, C*, DL/DH/RDX. I have tested some with partial regs in other registers (even though I'm mostly showing AL/AH for consistency), and have never noticed any difference.

How can we explain all of these observations with a sensible model of how the microarch works internally?


Related: Partial flag issues are different from partial register issues. See INC instruction vs ADD 1: Does it matter? for some super-weird stuff with shr r32,cl (and even shr r32,2 on Core2/Nehalem: don't read flags from a shift other than by 1).

See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for partial-flag stuff in adc loops.

解决方案

Other answers welcome to address Sandybridge and IvyBridge in more detail. I don't have access to that hardware.


I haven't found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I've tested so far supports this model:

AL is never renamed separately from RAX (or r15b from r15). So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD).

Write-only access to AL merges into RAX, with a dependency on RAX. For loads into AL, this is a micro-fused ALU+load uop that executes on p0156, which is one of the strongest pieces of evidence that it's truly merging on every write, and not just doing some fancy double-bookkeeping as Agner speculated.

Agner (and Intel) say Sandybridge can require a merging uop for AL, so it probably is renamed separately from RAX. For SnB, Intel's optimization manual (section 3.5.2.4 Partial Register Stalls) says

I think they're saying that on SnB, add al,bl will RMW the full RAX instead of renaming it separately, because one of the source registers is (part of) RAX. My guess is that this doesn't apply for a load like mov al, [rbx + rax]; rax in an addressing mode probably doesn't count as a source.

I haven't tested whether high8 merging uops still have to issue/rename on their own on HSW/SKL. That would make the front-end impact equivalent to 4 uops (since that's the issue/rename pipeline width).

  • There is no way to break a dependency involving AL without writing EAX/RAX. xor al,al doesn't help, and neither does mov al, 0.
  • movzx ebx, al has zero latency (renamed), and needs no execution unit. (i.e. mov-elimination works on HSW and SKL). It triggers merging of AH if it's dirty, which I guess is necessary for it to work without an ALU. It's probably not a coincidence that Intel dropped low8 renaming in the same uarch that introduced mov-elimination. (Agner Fog's micro-arch guide has a mistake here, saying that zero-extended moves are not eliminated on HSW or SKL, only IvB.)
  • movzx eax, al is not eliminated at rename. mov-elimination on Intel never works for same,same. mov rax,rax isn't eliminated either, even though it doesn't have to zero-extend anything. (Although there'd be no point to giving it special hardware support, because it's just a no-op, unlike mov eax,eax). Anyway, prefer moving between two separate architectural registers when zero-extending, whether it's with a 32-bit mov or an 8-bit movzx.
  • movzx eax, bx is not eliminated at rename on HSW or SKL. It has 1c latency and uses an ALU uop. Intel's optimization manual only mentions zero-latency for 8-bit movzx (and points out that movzx r32, high8 is never renamed).

High-8 regs can be renamed separately from the rest of the register, and do need merging uops.

  • Write-only access to ah with mov ah, reg8 or mov ah, [mem8] do rename AH, with no dependency on the old value. These are both instructions that wouldn't normally need an ALU uop for the 32-bit version. (But mov ah, bl is not eliminated; it does need a p0156 ALU uop so that might be a coincidence).
  • a RMW of AH (like inc ah) dirties it.
  • setcc ah depends on the old ah, but still dirties it. I think mov ah, imm8 is the same, but haven't tested as many corner cases.

    (Unexplained: a loop involving setcc ah can sometimes run from the LSD, see the rcr loop at the end of this post. Maybe as long as ah is clean at the end of the loop, it can use the LSD?).

    If ah is dirty, setcc ah merges into the renamed ah, rather than forcing a merge into rax. e.g. %rep 4 (inc al / test ebx,ebx / setcc ah / inc al / inc ah) generates no merging uops, and only runs in about 8.7c (latency of 8 inc al slowed down by resource conflicts from the uops for ah. Also the inc ah / setcc ah dep chain).

    I think what's going on here is that setcc r8 is always implemented as a read-modify-write. Intel probably decided that it wasn't worth having a write-only setcc uop to optimize the setcc ah case, since it's very rare for compiler-generated code to setcc ah. (But see the godbolt link in the question: clang4.0 with -m32 will do so.)

  • reading AX, EAX, or RAX triggers a merge uop (which takes up front-end issue/rename bandwidth). Probably the RAT (Register Allocation Table) tracks the high-8-dirty state for the architectural R[ABCD]X, and even after a write to AH retires, the AH data is stored in a separate physical register from RAX. Even with 256 NOPs between writing AH and reading EAX, there is an extra merging uop. (ROB size=224 on SKL, so this guarantees that the mov ah, 123 was retired). Detected with uops_issued/executed perf counters, which clearly show the difference.

  • Read-modify-write of AL (e.g. inc al) merges for free, as part of the ALU uop. (Only tested with a few simple uops, like add/inc, not div r8 or mul r8). Again, no merging uop is triggered even if AH is dirty.

  • Write-only to EAX/RAX (like lea eax, [rsi + rcx] or xor eax,eax) clears the AH-dirty state (no merging uop).

  • Write-only to AX (mov ax, 1) triggers a merge of AH first. I guess instead of special-casing this, it runs like any other RMW of AX/RAX. (TODO: test mov ax, bx, although that shouldn't be special because it's not renamed.)
  • xor ah,ah has 1c latency, is not dep-breaking, and still needs an execution port.
  • Read and/or write of AL does not force a merge, so AH can stay dirty (and be used independently in a separate dep chain). (e.g. add ah, cl / add al, dl can run at 1 per clock (bottlenecked on add latency).

Making AH dirty prevents a loop from running from the LSD (the loop-buffer), even when there are no merging uops. The LSD is when the CPU recycles uops in the queue that feeds the issue/rename stage. (Called the IDQ).

Inserting merging uops is a bit like inserting stack-sync uops for the stack-engine. Intel's optimization manual says that SnB's LSD can't run loops with mismatched push/pop, which makes sense, but it implies that it can run loops with balanced push/pop. That's not what I'm seeing on SKL: even balanced push/pop prevents running from the LSD (e.g. push rax / pop rdx / times 6 imul rax, rdx. (There may be a real difference between SnB's LSD and HSW/SKL: SnB may just "lock down" the uops in the IDQ instead of repeating them multiple times, so a 5-uop loop takes 2 cycles to issue instead of 1.25.) Anyway, it appears that HSW/SKL can't use the LSD when a high-8 register is dirty, or when it contains stack-engine uops.

This behaviour may be related to a an erratum in SKL:

This may also be related to Intel's optimization manual statement that SnB at least has to issue/rename an AH-merge uop in a cycle by itself. That's a weird difference for the front-end.

My Linux kernel log says microcode: sig=0x506e3, pf=0x2, revision=0x84.Arch Linux's intel-ucode package just provides the update, you have to edit config files to actually have it loaded. So my Skylake testing was on an i7-6700k with microcode revision 0x84, which doesn't include the fix for SKL150. It matches the Haswell behaviour in every case I tested, IIRC. (e.g. both Haswell and my SKL can run the setne ah / add ah,ah / rcr ebx,1 / mov eax,ebx loop from the LSD). I have HT enabled (which is a pre-condition for SKL150 to manifest), but I was testing on a mostly-idle system so my thread had the core to itself.

With updated microcode, the LSD is completely disabled for everything all the time, not just when partial registers are active. lsd.uops is always exactly zero, including for real programs not synthetic loops. Hardware bugs (rather than microcode bugs) often require disabling a whole feature to fix. This is why SKL-avx512 (SKX) is reported to not have a loopback buffer. Fortunately this is not a performance problem: SKL's increased uop-cache throughput over Broadwell can almost always keep up with issue/rename.


Extra AH/BH/CH/DH latency:

  • Reading AH when it's not dirty (renamed separately) adds an extra cycle of latency for both operands. e.g. add bl, ah has a latency of 2c from input BL to output BL, so it can add latency to the critical path even if RAX and AH are not part of it. (I've seen this kind of extra latency for the other operand before, with vector latency on Skylake, where an int/float delay "pollutes" a register forever. TODO: write that up.)

This means unpacking bytes with movzx ecx, al / movzx edx, ah has extra latency vs. movzx/shr eax,8/movzx, but still better throughput.

  • Reading AH when it is dirty doesn't add any latency. (add ah,ah or add ah,dh/add dh,ah have 1c latency per add). I haven't done a lot of testing to confirm this in many corner-cases.

    Hypothesis: a dirty high8 value is stored in the bottom of a physical register. Reading a clean high8 requires a shift to extract bits [15:8], but reading a dirty high8 can just take bits [7:0] of a physical register like a normal 8-bit register read.

Extra latency doesn't mean reduced throughput. This program can run at 1 iter per 2 clocks, even though all the add instructions have 2c latency (from reading DH, which is not modified.)

global _start
_start:
    mov     ebp, 100000000
.loop:
    add ah, dh
    add bh, dh
    add ch, dh
    add al, dh
    add bl, dh
    add cl, dh
    add dl, dh

    dec ebp
    jnz .loop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)
 Performance counter stats for './testloop':

     48.943652      task-clock (msec)         #    0.997 CPUs utilized          
             1      context-switches          #    0.020 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             3      page-faults               #    0.061 K/sec                  
   200,314,806      cycles                    #    4.093 GHz                    
   100,024,930      branches                  # 2043.675 M/sec                  
   900,136,527      instructions              #    4.49  insn per cycle         
   800,219,617      uops_issued_any           # 16349.814 M/sec                 
   800,219,014      uops_executed_thread      # 16349.802 M/sec                 
         1,903      lsd_uops                  #    0.039 M/sec                  

   0.049107358 seconds time elapsed


Some interesting test loop bodies:

%if 1
     imul eax,eax
     mov  dh, al
     inc dh
     inc dh
     inc dh
;     add al, dl
    mov cl,dl
    movzx eax,cl
%endif

Runs at ~2.35c per iteration on both HSW and SKL.  reading `dl` has no dep on the `inc dh` result.  But using `movzx eax, dl` instead of `mov cl,dl` / `movzx eax,cl` causes a partial-register merge, and creates a loop-carried dep chain.  (8c per iteration).


%if 1
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax         ; off the critical path unless there's a false dep

  %if 1
    test  ebx, ebx          ; independent of the imul results
    ;mov   ah, 123         ; dependent on RAX
    ;mov  eax,0           ; breaks the RAX dependency
    setz  ah              ; dependent on RAX
  %else
    mov   ah, bl          ; dep-breaking
  %endif

    add   ah, ah
    ;; ;inc   eax
;    sbb   eax,eax

    rcr   ebx, 1      ; dep on  add ah,ah  via CF
    mov   eax,ebx     ; clear AH-dirty

    ;; mov   [rdi], ah
    ;; movzx eax, byte [rdi]   ; clear AH-dirty, and remove dep on old value of RAX
    ;; add   ebx, eax          ; make the dep chain through AH loop-carried
%endif

The setcc version (with the %if 1) has 20c loop-carried latency, and runs from the LSD even though it has setcc ah and add ah,ah.

00000000004000e0 <_start.loop>:
  4000e0:       0f af c0                imul   eax,eax
  4000e3:       0f af c0                imul   eax,eax
  4000e6:       0f af c0                imul   eax,eax
  4000e9:       0f af c0                imul   eax,eax
  4000ec:       0f af c0                imul   eax,eax
  4000ef:       85 db                   test   ebx,ebx
  4000f1:       0f 94 d4                sete   ah
  4000f4:       00 e4                   add    ah,ah
  4000f6:       d1 db                   rcr    ebx,1
  4000f8:       89 d8                   mov    eax,ebx
  4000fa:       ff cd                   dec    ebp
  4000fc:       75 e2                   jne    4000e0 <_start.loop>

 Performance counter stats for './testloop' (4 runs):

       4565.851575      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.08% )
                 4      context-switches          #    0.001 K/sec                    ( +-  5.88% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.001 K/sec                  
    20,007,739,240      cycles                    #    4.382 GHz                      ( +-  0.00% )
     1,001,181,788      branches                  #  219.276 M/sec                    ( +-  0.00% )
    12,006,455,028      instructions              #    0.60  insn per cycle           ( +-  0.00% )
    13,009,415,501      uops_issued_any           # 2849.286 M/sec                    ( +-  0.00% )
    12,009,592,328      uops_executed_thread      # 2630.307 M/sec                    ( +-  0.00% )
    13,055,852,774      lsd_uops                  # 2859.456 M/sec                    ( +-  0.29% )

       4.565914158 seconds time elapsed                                          ( +-  0.08% )

Unexplained: it runs from the LSD, even though it makes AH dirty. (At least I think it does. TODO: try adding some instructions that do something with eax before the mov eax,ebx clears it.)

But with mov ah, bl, it runs in 5.0c per iteration (imul throughput bottleneck) on both HSW/SKL. (The commented-out store/reload works, too, but SKL has faster store-forwarding than HSW, and it's variable-latency...)

 #  mov ah, bl   version
 5,009,785,393      cycles                    #    4.289 GHz                      ( +-  0.08% )
 1,000,315,930      branches                  #  856.373 M/sec                    ( +-  0.00% )
11,001,728,338      instructions              #    2.20  insn per cycle           ( +-  0.00% )
12,003,003,708      uops_issued_any           # 10275.807 M/sec                   ( +-  0.00% )
11,002,974,066      uops_executed_thread      # 9419.678 M/sec                    ( +-  0.00% )
         1,806      lsd_uops                  #    0.002 M/sec                    ( +-  3.88% )

   1.168238322 seconds time elapsed                                          ( +-  0.33% )

Notice that it doesn't run from the LSD anymore.

这篇关于Haswell/Skylake上的部分寄存器的性能如何?编写AL似乎对RAX有错误的依赖关系,而AH是不一致的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-17 16:18