问题描述
背景:
在优化一些 code。与嵌入式汇编语言,我注意到了一个不必要的 MOV
指令,并删除它。
While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV
instruction, and removed it.
要我惊讶的是,剔除非必要的指令导致我的程序的放缓的
To my surprise, removing the un-necessary instruction caused my program to slow down.
我发现的添加任意的,无用的 MOV
指令提高性能更进一步。
I found that adding arbitrary, useless MOV
instructions increased performance even further.
效果是不稳定的,并根据执行顺序的变化:同垃圾指令换位向上或向下一行生产放缓
The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.
据我所知,CPU完成所有种类的优化和简化,但是,这似乎更像是黑魔法。
I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.
数据:
我的code的一个版本有条件编译三个垃圾操作在运行一个循环的中间 2 ** 20 == 1048576
倍。 (周围的程序才算)。
A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576
times. (The surrounding program just calculates SHA-256 hashes).
我的,而旧机(英特尔(R)酷睿(TM)2 CPU 6400 @ 2.13 千兆赫)的结果:
The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):
avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without: 1836.44 ms
中的程序是在一个循环中运行25次,运行顺序随机变化,每次
The programs were run 25 times in a loop, with the run order changing randomly each time.
摘录:
{$asmmode intel}
procedure example_junkop_in_sha256;
var s1, t2 : uint32;
begin
// Here are parts of the SHA-256 algorithm, in Pascal:
// s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
// s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
// Here is how I translated them (side by side to show symmetry):
asm
MOV r8d, a ; MOV r9d, e
ROR r8d, 2 ; ROR r9d, 6
MOV r10d, r8d ; MOV r11d, r9d
ROR r8d, 11 {13 total} ; ROR r9d, 5 {11 total}
XOR r10d, r8d ; XOR r11d, r9d
ROR r8d, 9 {22 total} ; ROR r9d, 14 {25 total}
XOR r10d, r8d ; XOR r11d, r9d
// Here is the extraneous operation that I removed, causing a speedup
// s1 is the uint32 variable declared at the start of the Pascal code.
//
// I had cleaned up the code, so I no longer needed this variable, and
// could just leave the value sitting in the r11d register until I needed
// it again later.
//
// Since copying to RAM seemed like a waste, I removed the instruction,
// only to discover that the code ran slower without it.
{$IFDEF JUNKOPS}
MOV s1, r11d
{$ENDIF}
// The next part of the code just moves on to another part of SHA-256,
// maj { r12d } := (a and b) xor (a and c) xor (b and c)
mov r8d, a
mov r9d, b
mov r13d, r9d // Set aside a copy of b
and r9d, r8d
mov r12d, c
and r8d, r12d { a and c }
xor r9d, r8d
and r12d, r13d { c and b }
xor r12d, r9d
// Copying the calculated value to the same s1 variable is another speedup.
// As far as I can tell, it doesn't actually matter what register is copied,
// but moving this line up or down makes a huge difference.
{$IFDEF JUNKOPS}
MOV s1, r9d // after mov r12d, c
{$ENDIF}
// And here is where the two calculated values above are actually used:
// T2 {r12d} := S0 {r10d} + Maj {r12d};
ADD r12d, r10d
MOV T2, r12d
end
end;
亲自试一试:
在code是在GitHub上在线的不断提高性能?
- Why would uselessly copying a register's contents to RAM ever increase performance?
- Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
- Is this behavior something that could be exploited predictably by a compiler?
推荐答案
的速度提高的最可能的原因是:
The most likely cause of the speed improvement is that:
- 插入一个MOV转移的后续指令到不同的内存地址
- 的那些移动指令之一是一个重要的条件分支
- 的分支被错误地predicted由于分支prediction表走样
- 动支淘汰了别名,并允许分支是正确pdicted $ P $
您的Core2不保留一个单独的历史记录,为每个条件跳转。相反,它使所有的条件跳转的共同的历史。 的一个缺点是,历史是由稀释如果不同的条件跳转是不相关的不相关的信息。
Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.
这一点展示了如何分支prediction缓冲区工作。高速缓冲存储器是由分支指令的地址的下部索引。这种运作良好,除非两个重要分支不相关共享相同的较低位。在这种情况下,最终与混叠,导致许多错误predicted分支(这档指令流水线和减慢你的程序)。
This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).
如果你想了解分支误predictions如何影响性能,看看这个优秀的答案:http://stackoverflow.com/a/11227902/1001643
If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: http://stackoverflow.com/a/11227902/1001643
编译器通常不具有足够的信息来知道哪个分支将别名和这些别名是否会显著。然而,这些信息可以在运行时确定的工具,如并的。
Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.
这篇关于为什么要引入无用MOV指令加快x86_64的装配紧密循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!