本文介绍了为什么用于测试Collat​​z猜想的C ++代码比手写汇编运行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用汇编语言和C ++为编写了这两种解决方案。他们采用相同的蛮力方法来测试。组装解决方案的组装方式如下:

  nasm -felf64 p14.asm& gcc p14.o -o p14 

C ++编译为:

  g ++ p14.cpp -o p14 

装配体, p14.asm :

  section .data 
fmt db%d,10,0

全局main
extern printf

section .text

main:
mov rcx,1000000
xor rdi,rdi; max i
xor rsi,rsi; i

l1:
dec rcx
xor r10,r10; count
mov rax,rcx

l2:
test rax,1
jpe甚至

mov rbx,3
mul rbx
inc rax
jmp c1

甚至:
mov rbx,2
xor rdx,rdx
div rbx

c1:
inc r10
cmp rax,1
jne l2

cmp rdi,r10
cmovl rdi,r10
cmovl rsi,rcx

cmp rcx,2
jne l1

mov rdi,fmt
xor rax,rax
呼叫printf
ret

C ++, p14.cpp :

  #include< iostream> 

int sequence(long n){
int count = 1;
而(n!= 1){
if(n%2 == 0)
n / = 2;
else
n = 3 * n + 1;
++;
}
个返回计数;
}

int main(){
int max = 0,maxi;
for(int i = 999999; i> 0; --i){
int s = sequence(i);
if(s> max){
max = s;
maxi = i;
}
}
std :: cout<<最大<< std :: endl;
}

我知道可以提高速度和所有功能的编译器优化,但是我看不到很多方法进一步优化我的汇编解决方案(以编程的方式,而不是数学上的讲)。


C ++代码每一项均使用模数,每隔一项除法,而汇编代码每隔一项仅使用一个除法


,但是汇编程序比C ++解决方案平均要花1秒的时间。为什么是这样?我主要是出于好奇。


执行时间


我的系统:1.4 GHz Intel Celeron 2955U(Haswell microarchitecture)上的64位Linux。



  • g ++ (未优化):平均1272毫秒。

  • g ++ -O3 :平均578毫秒。

  • asm(div) (原始):平均2650毫秒。

  • asm(shr):平均679毫秒。

  • (与NASM组装):平均501毫秒。

  • :平均200毫秒。

  • :平均145毫秒。

  • :使用 -O3 时平均81毫秒,使用 -O0


解决方案

如果您认为使用64位DIV指令是一种很好的方法,除以2,那么即使使用 -O0 (编译速度快,没​​有额外的优化,并且在之后存储/重新加载到内存中),也难怪编译器的asm输出会击败您的手写代码/在每个C语句之前,以便调试器可以修改变量。)


请参见,了解如何编写高效的汇编。他还具有指令表和微体系结构指南,以了解特定CPU的特定详细信息。另请参阅标签Wiki的问题,以获得更多性能链接。


另请参见以下有关用手写asm击败编译器的更一般的问题:。 TL:DR:是的,如果您做错了(例如此问题)。


通常,您可以让编译器执行其工作,特别是如果您尝试编写C ++可以有效地编译。另请参见。答案之一链接到,其中显示了各种C编译器如何进行优化一些非常简单的功能,还有很酷的技巧。 Matt Godbolt在CppCon2017上的演讲 在这种情况下,延迟是最相关的因素,因为它是循环携带的依赖项链的一部分。


shr rax,1 执行相同的未签名除法:1 uop,延迟为1c ,并且每个时钟周期可以运行2次。


为进行比较,32位除法速度更快,但与移位相比仍然很糟糕。 idiv r32 在Haswell上为9 oups,延迟为22-29c,每8-11c吞吐量为1。




从查看gcc的 -O0 asm输出可以看到(),它仅使用班次指令。 clang -O0 确实可以像您想象的那样天真地编译,甚至两次使用64位IDIV。 (在优化时,如果源代码使用相同的操作数进行除法和模数运算,则编译器会同时使用IDIV的两个输出,如果它们完全使用IDIV的话)。


GCC并不完全具有-天真模式。这包括识别按常数划分并使用移位(2的幂)或(不为2的幂),以避免IDIV(请参见上述Godbolt中的 div_by_13


gcc -Os (针对大小进行优化)确实将IDIV用于非电源- 2分法,不幸的是,即使在乘法逆代码稍大但速度更快的情况下,
也是如此。




帮助编译器


(这种情况的摘要:使用 uint64_t n )


首先,这只是很有趣查看优化的编译器输出。 ( -O3 )。


查看您的asm输出(在Godbolt上,或参阅)。当编译器一开始没有编写最佳代码时:以指导编译器编写更好代码的方式编写C / C ++源代码通常是最好的方法。您必须了解自动取款机,并知道有效的方法,但是您可以间接应用这些知识。编译器也是一个很好的想法来源:有时clang会做一些很酷的事情,并且您可以让gcc进行相同的操作:请参阅以及我在下面@Veedrac的代码中对非展开循环所做的操作。)


这种方法具有可移植性,并且在20年后,将来的某些编译器可以将其编译为在将来的硬件(无论是否为x86)上有效的任何方法,都可以使用新的ISA扩展或自动向量化。 15年前的手写x86-64 asm通常不会针对Skylake进行最佳调整。例如当时还不存在compare& branch宏融合。 目前对于一个微体系结构手工制作的asm最佳选择可能不适用于当前和将来的其他CPU。 讨论了AMD Bulldozer和Intel Haswell之间的主要区别对这段代码影响很大。但是从理论上讲, g ++ -O3 -march = bdver3 和 g ++ -O3 -march = skylake 会做正确的事。 (或 -march = native 。)或 -mtune = ... 进行调优,而无需使用其他说明CPU可能不支持。


我的感觉是,将编译器引导到对您关心的当前CPU有利的asm,对于将来的编译器来说应该不会成为问题。希望它们比目前的编译器在寻找转换代码的方法方面更好,并且可以找到适用于未来CPU的方法。无论如何,未来的x86可能不会对当前的x86产生任何好处,并且将来的编译器在实现C语言中的数据移动之类的东西时,如果看不到更好的东西,它将避免任何特定于asm的陷阱。


手写的asm是优化程序的黑匣子,因此当内联使输入成为编译时常量时,常量传播不起作用。其他优化也会受到影响。使用asm之前,请先阅读。 (并避免使用MSVC风格的嵌入式asm:输入/输出必须通过内存。)


在这种情况下:您的 n 具有带符号的类型,并且gcc使用SAR / SHR / ADD序列正确的舍入。 (对于负输入,IDIV和算术移位舍入有所不同,请参见)。 (IDK如果gcc尝试并未能证明 n 不能为负,或者不为负。签名溢出是未定义的行为,所以它应该可以。)


您应该使用 uint64_t n ,这样它才可以进行SHR。因此,它可以移植到 long 仅32位的系统(例如x86-64 Windows)上。




顺便说一句, gcc的优化 asm输出看起来不错(使用 unsigned long n ):内部循环内联到 main()中是这样的:

 #来自gcc5.4 -O3加我评论

#edx =计数= 1
#rax = uint64_t n

.L9:#do {
lea rcx,[rax + 1 + rax * 2]#rcx = 3 * n + 1
mov rdi,rax
shr rdi#rdi = n> 1;
test al,1#根据n%2(aka n& 1)设置标志
mov rax,rcx
cmove rax,rdi#n =(n%2)? 3 * n + 1:n / 2;
添加edx,1#++ count;
cmp rax,1个
jne .L9#} while(n!= 1)

cmp / branch更新max和maxi,然后执行下一个n

内部循环是无分支的,循环携带的依赖链的关键路径是:



  • 3分量LEA(3个周期)

  • cmov(在Haswell上为2个周期,在Broadwell或更高版本上为1c)。


总计:每次迭代5个周期,出现延迟瓶颈。乱序执行会与此同时进行其他所有工作(理论上:我没有用perf计数器测试过它是否真的以5c / iter运行)。


cmov 的FLAGS输入(由TEST生产)比RAX输入(来自LEA-> MOV)的生产速度更快,因此它不在关键路径上。

cmp / jne也不是关键路径的一部分:它不是循环携带的,因为控件依赖




击败编译器


GCC在这里做得很好。通过使用,因为没有人关心P4及其对部分标志修改指令的虚假依赖。


它还可以节省所有MOV指令和TEST:SHR会将CF =设置为移出的位,因此我们可以使用 cmovc 代替 test / cmovz 。

  ### gcc的手动优化版本
.L9:#do {
lea rcx,[rax + 1 + rax * 2]#rcx = 3 * n + 1
shr rax,1#n> = 1; CF = n& 1 = n%2
cmovc rax,rcx#n =(n& 1)? 3 * n + 1:n / 2;
inc edx#++ count;
cmp rax,1个
jne .L9#} while(n!= 1)

请参阅@johnfound的另一个巧妙技巧的答案:通过分支SHR的标志结果以及将其用于CMOV来删除CMP:仅当n为1(或0)时才为零。 (有趣的事实:他们将其设置为单联。不过,采用shift-by 1特殊编码就可以了。)


避免MOV根本无法解决Haswell上的延迟问题()。它确实对诸如MOV不是零延迟的Intel pre-IvB和AMD Bulldozer系列这样的CPU起到了很大的帮助作用。编译器浪费的MOV指令确实会影响关键路径。 BD的complex-LEA和CMOV都具有较低的延迟(分别为2c和1c),因此占延迟的比例较大。同样,吞吐量瓶颈也成为一个问题,因为它只有两个整数ALU管道。 ,其中


即使在Haswell上,该版本也可以避免一些非关键的uop从关键路径上的某个端口窃取执行端口的情况,从而有所帮助,从而有所帮助。将执行延迟1个周期。 (这称为资源冲突)。它还可以保存一个寄存器,这在交错循环中并行执行多个 n 个值时可能会有所帮助(请参见下文)。


LEA的延迟取决于英特尔SnB系列CPU的寻址模式。 3c用于3个组件( [base + idx + const] ,需要两个单独的加法),但只有1c具有两个或更少的组件(一个加法)。某些CPU(例如Core2)甚至可以在一个周期内完成3分量LEA,但SnB系列却不这样做。更糟糕的是,,否则3分量LEA就像Bulldozer一样只有2c 。 (三分量LEA在AMD上也较慢,只是幅度不那么大。)


所以 lea rcx,[rax + rax * 2] / inc rcx 仅2c延迟,比 lea rcx [[rax + rax * 2 + 1] ,在像Haswell这样的Intel SnB系列CPU上。 BD上收支平衡,而Core2上更差。它确实需要额外的uop,通常不值得节省1c的延迟,但是延迟是这里的主要瓶颈,Haswell有足够宽的管道来处理额外的uop吞吐量。


gcc,icc或clang(在godbolt上)均未使用SHR的CF输出,而是始终使用AND或TEST 。愚蠢的编译器。 :P它们是复杂机器的重要组成部分,但是聪明的人经常可以在小规模问题上击败它们。 (当然,考虑到这要花费数千到数百万倍!编译器不会使用穷举算法来搜索做事的每种可能方法,因为在优化许多内联代码时这将花费很长时间。他们也做得最好。他们也没有在目标微架构中对管道建模,至少与。


诀窍是确定是否等到所有 n 值都达到 1 之后再获得另一对开始的 n 的值,或者是否为达到终止条件的一个中断而中断并获得新的起始点,而无需为另一个序列触摸寄存器。也许最好让每个链条都处理有用的数据,否则您将不得不有条件地增加其计数器。




您甚至可以使用SSE来做到这一点。打包比较的东西,有条件地增加 n 尚未达到 1 的矢量元素的计数器。然后,要隐藏SIMD条件增量实现的更长延迟,您需要让更多的 n 值向量保持悬而未决。 Maybe only worth with 256b vector (4x uint64_t).


I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you’ve seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.


Untested idea for manual vectorization


# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements) 
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector

.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?

vprllq ymm3, ymm0, 63 # shift bit 1 to the sign bit

vpsrlq ymm0, ymm0, 1 # n /= 2

# FP blend between integer insns may cost extra bypass latency, but integer blends don’t have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.

# ymm0 = updated n in each element.

vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true

vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1

vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn’t exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.

You can and should implement this with intrinsics instead of hand-written asm.




Algorithmic / implementation improvement:


Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work.例如memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher’s answer)


@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That’s probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It’s still compatible with doing multiple scalar ns in parallel in different integer registers, though.


So the loop might look like this:

goto loop_entry; // C++ structured like the asm, for illustration only 
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86’s crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).


It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. @Veedrac’s answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).


We can safely use or interchangeably, because n can never be zero at that point. TZCNT’s machine-code decodes as BSF on CPUs that don’t support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).


TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don’t care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.


They perform the same on Intel CPUs, so just save the byte if that’s all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there’s nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn’t, or that is retroactively disallowed. e.g. , which was safe when the code was written, .)


Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see . This is why in gcc’s asm output for @Veedrac’s code, you see it on the register it’s about to use as TZCNT’s destination when it doesn’t use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it’s worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.




Improvements to ideas / code from other answers:


@hidefromkgb’s answer has a nice observation that you’re guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.


I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in . (This answer hit the 30k char limit from the large Godbolt URLs, but and were too long for goo.gl anyway.)


Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.




@Veedrac’s code


I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.


code + comments . Don’t use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.


See discussion in comments: Veedrac’s code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)


I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:

nasm -felf64 p14.asm && gcc p14.o -o p14

The C++ was compiled with:

g++ p14.cpp -o p14

Assembly, p14.asm:

section .data
    fmt db "%d", 10, 0

global main
extern printf

section .text

main:
    mov rcx, 1000000
    xor rdi, rdi        ; max i
    xor rsi, rsi        ; i

l1:
    dec rcx
    xor r10, r10        ; count
    mov rax, rcx

l2:
    test rax, 1
    jpe even

    mov rbx, 3
    mul rbx
    inc rax
    jmp c1

even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

c1:
    inc r10
    cmp rax, 1
    jne l2

    cmp rdi, r10
    cmovl rdi, r10
    cmovl rsi, rcx

    cmp rcx, 2
    jne l1

    mov rdi, fmt
    xor rax, rax
    call printf
    ret

C++, p14.cpp:

#include <iostream>

int sequence(long n) {
    int count = 1;
    while (n != 1) {
        if (n % 2 == 0)
            n /= 2;
        else
            n = 3*n + 1;
        ++count;
    }
    return count;
}

int main() {
    int max = 0, maxi;
    for (int i = 999999; i > 0; --i) {
        int s = sequence(i);
        if (s > max) {
            max = s;
            maxi = i;
        }
    }
    std::cout << maxi << std::endl;
}

I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).

The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.

But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.

Execution times

My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).

解决方案

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).

See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.

See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).

Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid" is in a similar vein.


even:
    mov rbx, 2
    xor rdx, rdx
    div rbx

On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.

shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.

For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.


As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)

GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).

gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.


Helping the compiler

(summary for this case: use uint64_t n)

First of all, it's only interesting to look at optimized compiler output. (-O3). -O0 speed is basically meaningless.

Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in @Veedrac's code below.)

This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on @johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.

My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.

Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)

In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)

You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).


BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:

 # from gcc5.4 -O3  plus my comments

 # edx= count=1
 # rax= uint64_t n

.L9:                   # do{
    lea    rcx, [rax+1+rax*2]   # rcx = 3*n + 1
    mov    rdi, rax
    shr    rdi         # rdi = n>>1;
    test   al, 1       # set flags based on n%2 (aka n&1)
    mov    rax, rcx
    cmove  rax, rdi    # n= (n%2) ? 3*n+1 : n/2;
    add    edx, 1      # ++count;
    cmp    rax, 1
    jne   .L9          #}while(n!=1)

  cmp/branch to update max and maxi, and then do the next n

The inner loop is branchless, and the critical path of the loop-carried dependency chain is:

  • 3-component LEA (3 cycles)
  • cmov (2 cycles on Haswell, 1c on Broadwell or later).

Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).

The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.

Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.

The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.


Beating the compiler

GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.

It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.

 ### Hand-optimized version of what gcc does
.L9:                       #do{
    lea     rcx, [rax+1+rax*2] # rcx = 3*n + 1
    shr     rax, 1         # n>>=1;    CF = n&1 = n%2
    cmovc   rax, rcx       # n= (n&1) ? 3*n+1 : n/2;
    inc     edx            # ++count;
    cmp     rax, 1
    jne     .L9            #}while(n!=1)

See @johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)

Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency. The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See @johnfound's answer, where he has timing results from an AMD CPU.

Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).

LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).

So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.

Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)


Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.

Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.

The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.


You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).

I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.

Untested idea for manual vectorization

# starting with YMM0 = [ n_d, n_c, n_b, n_a ]  (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1):  increment vector
# ymm5 = all-zeros:  count vector

.inner_loop:
    vpaddq    ymm1, ymm0, xmm0
    vpaddq    ymm1, ymm1, xmm0
    vpaddq    ymm1, ymm1, set1_epi64(1)     # ymm1= 3*n + 1.  Maybe could do this more efficiently?

    vprllq    ymm3, ymm0, 63                # shift bit 1 to the sign bit

    vpsrlq    ymm0, ymm0, 1                 # n /= 2

    # FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
    vpblendvpd ymm0, ymm0, ymm1, ymm3       # variable blend controlled by the sign bit of each 64-bit element.  I might have the source operands backwards, I always have to look this up.

    # ymm0 = updated n  in each element.

    vpcmpeqq ymm1, ymm0, set1_epi64(1)
    vpandn   ymm4, ymm1, ymm4         # zero out elements of ymm4 where the compare was true

    vpaddq   ymm5, ymm5, ymm4         # count++ in elements where n has never been == 1

    vptest   ymm4, ymm4
    jnz  .inner_loop
    # Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero

    vextracti128 ymm0, ymm5, 1
    vpmaxq .... crap this doesn't exist
    # Actually just delay doing a horizontal max until the very very end.  But you need some way to record max and maxi.

You can and should implement this with intrinsics instead of hand-written asm.


Algorithmic / implementation improvement:

Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)

@EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.

So the loop might look like this:

goto loop_entry;  // C++ structured like the asm, for illustration only
do {
   n = n*3 + 1;
  loop_entry:
   shift = _tzcnt_u64(n);
   n >>= shift;
   count += shift;
} while(n != 1);

This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).

It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. @Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).

We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).

TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.

They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)

Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for @Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.


Improvements to ideas / code from other answers:

@hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.

I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in @hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)

Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.


@Veedrac's code

I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.

code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.

See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

这篇关于为什么用于测试Collat​​z猜想的C ++代码比手写汇编运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 13:29