问题描述
我有以下代码:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $200, %ebx\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time1:\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time2:\n"
" mov $200, %ebx\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
第一种情况
我编译它
gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
它的输出是
14474
5837
我也用 Clang 编译它,结果相同.
I also compiled it with Clang with the same result.
那么,为什么第二种情况更快(几乎是 3 倍加速)?它实际上与一些微架构细节有关吗?如果重要的话,我有一个 AMD 的 CPU:AMD A9-9410 RADEON R5,5 个计算核心 2C+3G".
So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: "AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G".
推荐答案
mov $200, %ebx
循环内通过ebx
,允许乱序执行在多次迭代中重叠 5 个 add
指令链.
mov $200, %ebx
inside the loop breaks the loop-carried dependency chain through ebx
, allowing out-of-order execution to overlap the chain of 5 add
instructions across multiple iterations.
如果没有它,add
指令链会在 add
(1 个周期)关键路径的延迟上成为循环瓶颈,而不是吞吐量(4/cycle on挖掘机,改进自2/在压路机上循环).您的 CPU 是 挖掘机核心.
Without it, the chain of add
instructions bottlenecks the loop on the latency of the add
(1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from 2/cycle on Steamroller). Your CPU is an Excavator core.
AMD 因为 Bulldozer 有一个高效的 loop
指令(只有 1 uop),不像 Intel CPU,其中 loop
会在每 7 个周期 1 次迭代时出现循环瓶颈.(https://agner.org/optimize/ 用于说明表、微架构指南以及有关所有内容的更多详细信息在这个答案中.)
AMD since Bulldozer has an efficient loop
instruction (only 1 uop), unlike Intel CPUs where loop
would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)
使用 loop
和 mov
将前端(和后端执行单元)中的插槽从 add
中取出,改为 3x4 倍加速看起来差不多.
With loop
and mov
taking slots in the front-end (and back-end execution units) away from add
, a 3x instead of 4x speedup looks about right.
参见这个答案 有关 CPU 如何查找和利用指令级并行 (ILP) 的介绍.
See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).
见了解 lfence 对具有两个长依赖链的循环的影响,以增加长度,了解有关重叠独立 dep 链的一些深入细节.
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.
顺便说一句,10k 次迭代并不多.那时您的 CPU 甚至可能不会达到空闲速度.或者可能会在第二个循环的大部分时间跳到最大速度,但第一个循环都没有.所以要小心像这样的微基准测试.
BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.
此外,您的内联汇编是不安全的,因为您忘记在 EAX、EBX 和 ECX 上声明 clobbers.你在没有告诉它的情况下踩到了编译器的寄存器.通常,您应该始终在启用优化的情况下进行编译,但如果这样做,您的代码可能会中断.
Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.
这篇关于为什么在展开的 ADD 循环中重新初始化寄存器会使它运行得更快,即使循环中有更多指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!