问题描述
获得一些反馈后,我创建了一个,该示例应更具可重复性.
After getting some feedback, I created a new example which should be more reproducible.
我一直在用C ++编写一个项目,该项目涉及很多链表迭代.为了获得基准,我重写了Go中的代码.令人惊讶的是,我发现即使将-O标志传递给clang ++,Go实现的运行速度也始终稳定〜10%.也许我只是缺少一些C ++的明显优化,但是我已经通过各种调整将自己的头撞墙了一段时间了.
I've been writing a project in C++ that involves lots of linked list iteration. To get a benchmark, I rewrote the code in Go. Surprisingly I've found that the Go implementation runs consistently faster by ~10%, even after passing the -O flag to clang++. Probably I'm just missing some obvious optimization in C++ but I've been banging my head against a wall for a while with various tweaks.
这是简化的版本,在C ++和Go中具有相同的实现,其中Go程序运行速度更快.它要做的就是创建一个包含3000个节点的链表,然后花费1,000,000次遍历此列表所需的时间(C ++中为7.5秒,Go中为6.8).
Here's a simplified version, with identical implementations in C++ and Go where the Go program runs faster. All it does is create a linked list with 3000 nodes, and then time how long it takes to iterate over this list 1,000,000 times (7.5 secs in C++, 6.8 in Go).
C ++:
#include <iostream> #include <chrono> using namespace std; using ms = chrono::milliseconds; struct Node { Node *next; double age; }; // Global linked list of nodes Node *nodes = nullptr; void iterateAndPlace(double age) { Node *node = nodes; Node *prev = nullptr; while (node != nullptr) { // Just to make sure that age field is accessed if (node->age > 99999) { break; } prev = node; node = node->next; } // Arbitrary action to make sure the compiler // doesn't optimize away this function prev->age = age; } int main() { Node x = {}; std::cout << "Size of struct: " << sizeof(x) << "\n"; // 16 bytes // Fill in global linked list with 3000 dummy nodes for (int i=0; i<3000; i++) { Node* newNode = new Node; newNode->age = 0.0; newNode->next = nodes; nodes = newNode; } auto start = chrono::steady_clock::now(); for (int i=0; i<1000000; i++) { iterateAndPlace(100.1); } auto end = chrono::steady_clock::now(); auto diff = end - start; std::cout << "Elapsed time is : "<< chrono::duration_cast<ms>(diff).count()<<" ms "<<endl; }
开始:
package main import ( "time" "fmt" "unsafe" ) type Node struct { next *Node age float64 } var nodes *Node = nil func iterateAndPlace(age float64) { node := nodes var prev *Node = nil for node != nil { if node.age > 99999 { break } prev = node node = node.next } prev.age = age } func main() { x := Node{} fmt.Printf("Size of struct: %d\n", unsafe.Sizeof(x)) // 16 bytes for i := 0; i < 3000; i++ { newNode := new(Node) newNode.next = nodes nodes = newNode } start := time.Now() for i := 0; i < 1000000; i++ { iterateAndPlace(100.1) } fmt.Printf("Time elapsed: %s\n", time.Since(start)) }
Mac的输出:
$ go run minimal.go Size of struct: 16 Time elapsed: 6.865176895s $ clang++ -std=c++11 -stdlib=libc++ minimal.cpp -O3; ./a.out Size of struct: 16 Elapsed time is : 7524 ms
C语版本:
$ clang++ --version Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin15.6.0 Thread model: posix
UKMonkey提出了这样一个事实,即可以在Go中连续分配节点,而不能在C ++中分配.为了测试这一点,我在C ++中连续分配了一个向量,但这并没有更改运行时间:
UKMonkey brought up the fact that the nodes may be contiguously allocated in Go but not C++. To test this, I allocated contiguously in C++ with a vector, and this did not change the runtime:
// Fill in global linked list with 3000 contiguous dummy nodes vector<Node> vec; vec.reserve(3000); for (int i=0; i<3000; i++) { vec.push_back(Node()); } nodes = &vec[0]; Node *curr = &vec[0]; for (int i=1; i<3000; i++) { curr->next = &vec[i]; curr = curr->next; curr->age = 0.0; }
我检查了生成的链接列表是否确实是连续的:
I checked that the resulting linked list is indeed contiguous:
std::cout << &nodes << " " << &nodes->next << " " << &nodes->next->next << " " << &nodes->next->next->next << "\n"; 0x1032de0e0 0x7fb934001000 0x7fb934001010 0x7fb934001020
推荐答案
前言:我不是C ++专家或汇编专家.但是我知道其中的一点点,也许足以危险.
Preface: I am not a C++ expert or assembly expert. But I know a little bit of them, enough to be dangerous, perhaps.
因此,我很生气,决定看一看为Go生成的汇编程序,然后对它进行后续检查,并对照clang ++的输出进行检查.
So I was piqued and decided to take a look at the assembler generated for the Go, and followed it up with checking it against the output for clang++.
高级摘要
稍后,我将在x86-64汇编器中浏览两种语言的汇编器输出.此示例中代码的基本关键部分"是一个非常紧密的循环.因此,它是该程序花费时间的最大贡献者.
Later on here, I go through the assembler output for both languages in x86-64 assembler. The fundamental "critical section" of code in this example is a very tight loop. For that reason, it's the largest contributor to the time spent in the program.
紧密循环的原因在于,现代CPU执行指令的速度通常比可以从内存中加载要引用的代码的相关值(例如进行比较)要快.为了实现它们所达到的超快速度,CPU执行了许多技巧,包括流水线化,分支预测等.紧密的循环通常是流水线的祸根,实际上,如果值之间存在依赖关系,则分支预测可能仅会有所帮助.
Why tight loops matter is that modern CPU's can execute instructions usually faster than relevant values for the code to reference (like for comparisons) can be loaded from memory. In order to achieve the blazing fast speeds they achieve, CPU's perform a number of tricks including pipelining, branch prediction, and more. Tight loops are often the bane of pipelining and realistically branch prediction could be only marginally helpful if there's a dependency relationship between values.
从根本上讲,遍历循环有四个主要块:
Fundamentally, the traversal loop has four main chunks:
1. If `node` is null, exit the loop. 2. If `node.age` > 999999, exit the loop. 3a. set prev = node 3b. set node = node.next
每一个都由几个汇编器指令表示,但是Go和C ++输出的块的顺序不同. C ++可以按3a, 1, 2, 3b的顺序有效地执行此操作. Go版本按3, 2, 1的顺序执行. (它开始了第2段的第一个循环,以避免赋值检查在空检查之前发生)
Each of these are represented by several assembler instructions, but the chunks as output by the Go and C++ are ordered differently. The C++ effectively does it in order 3a, 1, 2, 3b. The Go version does it in order 3, 2, 1. (it starts the first loop on segment 2 to avoid the assignment happening before the null checks)
实际上,与Go相比,clang ++输出的指令少一些,并且应该进行较少的RAM访问(以增加一个浮点寄存器为代价).可能会想到,以不同的顺序执行几乎相同的指令应该花费相同的时间,但这没有考虑流水线和分支预测.
In actuality, the clang++ outputs a couple fewer instructions than the Go and should do fewer RAM accesses (at the cost of one more floating point register). One might imagine that executing almost the same instructions just in different orders should end up with the same time spent, but that doesn't take into account pipelining and branch prediction.
要点如果这是一个关键但很小的循环,则可能会尝试手动优化此代码并编写汇编.忽略明显的原因(风险更大/更复杂/更容易出现错误),还需要考虑到,尽管Go生成的代码对于我测试过的两个Intel x86-64处理器而言都更快,但AMD可能处理器,您将得到相反的结果.使用第N + 1代Intel,您也有可能获得不同的结果.
Takeaways One might be tempted to hand-optimize this code and write assembly if it was a critical but small loop. Ignoring the obvious reasons (it's more risky/more complex/more prone to bugs) there's also to take into account that while the Go generated code was faster for the two Intel x86-64 processors I tested it with, it's possible that with an AMD processor you'd get the opposite results. It's also possible that with the N+1th gen Intel you'll get different results.
我的全面调查如下:
调查
注意我已经尽可能简短地摘录了一些示例,包括截断文件名,以及从程序集列表中删除多余的绒毛,因此您的输出看起来可能与我的略有不同.但是无论如何,我继续.
NOTE I've snipped examples as short as possible including truncating filenames, and removing excess fluff from the assembly listing, so your outputs may look slightly different from mine. But anyway, I continue.
所以我运行go build -gcflags -S main.go以获得该程序集清单,而我只是在真正地查看iterateAndPlace.
So I ran go build -gcflags -S main.go to get this assembly listing, and I'm only really looking at iterateAndPlace.
"".iterateAndPlace STEXT nosplit size=56 args=0x8 locals=0x0 00000 (main.go:16) TEXT "".iterateAndPlace(SB), NOSPLIT, $0-8 00000 (main.go:16) FUNCDATA $0, gclocals·2a5305abe05176240e61b8620e19a815(SB) 00000 (main.go:16) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 00000 (main.go:17) MOVQ "".nodes(SB), AX 00007 (main.go:17) MOVL $0, CX 00009 (main.go:20) JMP 20 00011 (main.go:25) MOVQ (AX), DX 00014 (main.go:25) MOVQ AX, CX 00017 (main.go:25) MOVQ DX, AX 00020 (main.go:20) TESTQ AX, AX 00023 (main.go:20) JEQ 44 00025 (main.go:21) MOVSD 8(AX), X0 00030 (main.go:21) MOVSD $f64.40f869f000000000(SB), X1 00038 (main.go:21) UCOMISD X1, X0 00042 (main.go:21) JLS 11 00044 (main.go:21) MOVSD "".age+8(SP), X0 00050 (main.go:28) MOVSD X0, 8(CX) 00055 (main.go:29) RET
万一您失去上下文,我将在此处粘贴原始行和行号:
In case you lost context, I'll paste the original listing with the line numbers here:
16 func iterateAndPlace(age float64) { 17 node := nodes 18 var prev *Node = nil 19 20 for node != nil { 21 if node.age > 99999 { 22 break 23 } 24 prev = node 25 node = node.next 26 } 27 28 prev.age = age 29 }
我立即注意到一些有趣的事情:
A few interesting things I noticed immediately:
- 它没有为第24行prev = node生成任何代码.这是因为已经意识到分配可以被欺骗:在遍历以获取node.next时,它使用CX寄存器(其为prev的值).这可能是SSA编译器可以实现的一个很好的优化,是多余的.
- 检查node.age的if语句被重新排序为东西之后的 ,在第一次迭代时被跳过.在这种情况下,您可以将其视为do..while循环.总体上来说是次要的,因为它只会真正改变第一次迭代.但是也许就足够了吗?
- It's not generating any code for line 24, prev = node. That is because it's realized that assignment can be cheated: in traversing to get node.next it uses the CX register which is the value of prev. This is probably a nice optimization the SSA compiler can realize is redundant.
- The if statement checking for node.age is re-ordered to be after the node = node.nextstuff, that is skipped on the first iteration. You can think of this as more like a do..while loop in that case. Overall minor since it only really changes the first iteration. But maybe that's enough?
因此,让我们跳到从clang++ -S -mllvm --x86-asm-syntax=intel -O3 minimal.cpp获得的C ++程序集.
So let's jump over to the C++ assembly, which you get from clang++ -S -mllvm --x86-asm-syntax=intel -O3 minimal.cpp.
.quad 4681608292164698112 ## double 99999 # note I snipped some stuff here __Z15iterateAndPlaced: ## @_Z15iterateAndPlaced ## BB#0: push rbp Lcfi0: .cfi_def_cfa_offset 16 Lcfi1: .cfi_offset rbp, -16 mov rbp, rsp Lcfi2: .cfi_def_cfa_register rbp mov rcx, qword ptr [rip + _nodes] xor eax, eax movsd xmm1, qword ptr [rip + LCPI0_0] ## xmm1 = mem[0],zero .p2align 4, 0x90 LBB0_2: ## =>This Inner Loop Header: Depth=1 mov rdx, rax mov rax, rcx movsd xmm2, qword ptr [rax + 8] ## xmm2 = mem[0],zero ucomisd xmm2, xmm1 ja LBB0_3 ## BB#1: ## in Loop: Header=BB0_2 Depth=1 mov rcx, qword ptr [rax] test rcx, rcx mov rdx, rax jne LBB0_2 LBB0_3: movsd qword ptr [rdx + 8], xmm0 pop rbp ret
这真的很有趣.生成的程序集总体上非常相似(忽略了汇编程序列出语法的方式上的细微差别)-它对不分配prev进行了类似的优化.此外,C ++似乎消除了每次比较时都加载99999的需要(Go版本每次都在比较之前加载它).
This is really interesting. The assembly generated is overall fairly similar (ignoring the minor differences in how assemblers list the syntax) - It made a similar optimization about not assigning prev. Furthermore, the C++ seems to have eliminated the need to load 99999 every time the comparison is done (the Go version loads it right before comparison each time).
出于复制目的,我使用的东西的版本(在OSX High Sierra的x86-64 darwin mac上)
For replication purposes, versions of things I used (on an x86-64 darwin mac on OSX High Sierra)
$ go version go version go1.9.3 darwin/amd64 $ clang++ --version Apple LLVM version 9.0.0 (clang-900.0.39.2) Target: x86_64-apple-darwin17.4.0
这篇关于在C ++中遍历链表比在Go中慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!