为什么GCC和Clang会为此代码(x86_64,-O3 -std = c++ 17)生成如此不同的asm?

#include <atomic>

int global_var = 0;

int foo_seq_cst(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_seq_cst);
    return ia.load(std::memory_order_seq_cst);
}

int foo_relaxed(int a)
{
    std::atomic<int> ia;
    ia.store(global_var + a, std::memory_order_relaxed);
    return ia.load(std::memory_order_relaxed);
}

GCC 9.1:
foo_seq_cst(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mfence
        mov     eax, DWORD PTR [rsp-4]
        ret
foo_relaxed(int):
        add     edi, DWORD PTR global_var[rip]
        mov     DWORD PTR [rsp-4], edi
        mov     eax, DWORD PTR [rsp-4]
        ret

铛8.0:
foo_seq_cst(int):                       # @foo_seq_cst(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     eax, edi
        add     eax, dword ptr [rip + global_var]
        ret

我怀疑这里的mfence是一个矫kill过正,对吗?还是Clang生成的代码在某些情况下可能导致错误?

最佳答案

更加现实的example:

#include <atomic>

std::atomic<int> a;

void foo_seq_cst(int b) {
    a = b;
}

void foo_relaxed(int b) {
    a.store(b, std::memory_order_relaxed);
}

gcc-9.1:
foo_seq_cst(int):
        mov     DWORD PTR a[rip], edi
        mfence
        ret
foo_relaxed(int):
        mov     DWORD PTR a[rip], edi
        ret

clang-8.0:
foo_seq_cst(int):                       # @foo_seq_cst(int)
        xchg    dword ptr [rip + a], edi
        ret
foo_relaxed(int):                       # @foo_relaxed(int)
        mov     dword ptr [rip + a], edi
        ret

gcc使用mfence,而clang使用xchg表示std::memory_order_seq_cst
xchg表示lock前缀。 lockmfence都满足std::memory_order_seq_cst的要求,即没有重新排序和总顺序。

摘自Intel 64和IA-32体系结构软件开发人员手册:



lock was benchmarked to be 2-3x faster than mfence 和Linux在可能的情况下从mfence切换到lock

09-03 22:30