为什么GCC和Clang会为此代码(x86_64,-O3 -std = c++ 17)生成如此不同的asm?
#include <atomic>
int global_var = 0;
int foo_seq_cst(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_seq_cst);
return ia.load(std::memory_order_seq_cst);
}
int foo_relaxed(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_relaxed);
return ia.load(std::memory_order_relaxed);
}
GCC 9.1:
foo_seq_cst(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mfence
mov eax, DWORD PTR [rsp-4]
ret
foo_relaxed(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mov eax, DWORD PTR [rsp-4]
ret
铛8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
foo_relaxed(int): # @foo_relaxed(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
我怀疑这里的mfence是一个矫kill过正,对吗?还是Clang生成的代码在某些情况下可能导致错误?
最佳答案
更加现实的example:
#include <atomic>
std::atomic<int> a;
void foo_seq_cst(int b) {
a = b;
}
void foo_relaxed(int b) {
a.store(b, std::memory_order_relaxed);
}
gcc-9.1:
foo_seq_cst(int):
mov DWORD PTR a[rip], edi
mfence
ret
foo_relaxed(int):
mov DWORD PTR a[rip], edi
ret
clang-8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
xchg dword ptr [rip + a], edi
ret
foo_relaxed(int): # @foo_relaxed(int)
mov dword ptr [rip + a], edi
ret
gcc使用
mfence
,而clang使用xchg
表示std::memory_order_seq_cst
。xchg
表示lock
前缀。 lock
和mfence
都满足std::memory_order_seq_cst
的要求,即没有重新排序和总顺序。摘自Intel 64和IA-32体系结构软件开发人员手册:
lock
was benchmarked to be 2-3x faster than mfence
和Linux在可能的情况下从mfence
切换到lock
。