x86 mfence和C ++内存屏障

本文介绍了x86 mfence和C ++内存屏障的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在检查编译器如何发出x86_64上的多核内存屏障指令。以下代码是我正在使用 gcc_x86_64_8.3 测试的代码。

  std :: atomic< bool>标记{false}; 
 int any_value {0}; 
 
 void set（）
 {
 any_value = 10; 
 flag.store（true，std :: memory_order_release）; 
} 
 
 void get（）
 {
 while（！flag.load（std :: memory_order_acquire））; 
 assert（any_value == 10）; 
} 
 
 int main（）
 {
 std :: thread a {set}; 
 get（）; 
 a.join（）; 
}

当我使用 std :: memory_order_seq_cst ，我可以看到 MFENCE 指令与任何优化 -O1，-O2，-O3 一起使用。该指令确保刷新了存储缓冲区，因此在L1D缓存中更新了它们的数据（并使用MESI协议确保其他线程可以看到效果）。

但是，当我使用 std :: memory_order_release / acquire 而没有优化MFENCE 指令，但是使用 -O1，-O2，-O3 优化，并且看不到其他刷新缓冲区的指令，将省略该指令。 / p>

在不使用 MFENCE 的情况下，可以确保将存储缓冲区数据提交到高速缓存以确保内存顺序语义？

 
 
 下面是带有 -O3 的get / set函数的汇编代码，例如我们得到：
  set（）：
 mov DWORD PTR any_value [rip]，10 
 mov BYTE PTR标志[rip]，1 
 ret 
 
 
 .LC0：
 .string /tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp 
 .LC1：
 .string any_value == 10  
 
 get（）：$ b $b。L8：
 movzx eax，BYTE PTR flag [rip] 
 test al，al 
 je .L8 
 cmp DWORD PTR any_value [rip]，10 
 jne .L15 
 ret 
 .L15 ：
 push rax 
 mov ecx，偏移量：get（）:: __ PRETTY_FUNCTION__ 
 ed edx，17 
 mov esi，偏移量：.LC0 
 mov edi，偏移量平面：.LC1 
调用__assert_fail 
  
 
 
解决方案
 x86内存排序模型为所有存储指令提供了#StoreStore和#LoadStore障碍，这是发行语义所要求的。另外，处理器将尽快提交存储指令。当存储指令退出时，存储区将成为存储缓冲区中最旧的存储区，核心的目标高速缓存行处于可写一致性状态，并且高速缓存端口可用于执行存储操作。因此，不需要 MFENCE 指令。该标志将尽快对其他线程可见，当它出现时， any_value 保证为10。
 
 
 另一方面，顺序一致性还需要#StoreLoad和#LoadLoad障碍。必须同时提供 MFENCE 来同时提供障碍，因此必须在所有优化级别上使用它。
 
 
 相关：大小硬件上的存储缓冲区的数量？ 
 
 
 
 
 
 脚注：
 
 
 （1）有些例外不适用于此处。特别是，非临时存储和存储到不可缓存的写合并内存类型仅提供#LoadStore障碍。无论如何，这些障碍都为商店提供了Intel和AMD处理器上的回写内存类型。
 
 
 （2）这与写合并存储相反，后者在某些条件下被设置为全局可见。请参阅英特尔手册第3卷的11.3.1节。
 
 
 （3）请参见Peter的回答下的讨论。
 
I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using gcc_x86_64_8.3.
std::atomic<bool> flag {false};
int any_value {0};

void set()
{
  any_value = 10;
  flag.store(true, std::memory_order_release);
}

void get()
{
  while (!flag.load(std::memory_order_acquire));
  assert(any_value == 10);
}

int main()
{
  std::thread a {set};
  get();
  a.join();
}
When I use std::memory_order_seq_cst, I can see the MFENCE instruction is used with any optimization -O1, -O2, -O3. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect). 
However when I use std::memory_order_release/acquire with no optimizations MFENCE instruction is also used, but the instruction is omitted using -O1, -O2, -O3 optimizations, and not seeing other instructions that flush the buffers.
In the case where MFENCE is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?Below is the assembly code for the get/set functions with -O3, like what we get on the Godbolt compiler explorer:
set():
        mov     DWORD PTR any_value[rip], 10
        mov     BYTE PTR flag[rip], 1
        ret


.LC0:
        .string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
        .string "any_value == 10"

get():
.L8:
        movzx   eax, BYTE PTR flag[rip]
        test    al, al
        je      .L8
        cmp     DWORD PTR any_value[rip], 10
        jne     .L15
        ret
.L15:
        push    rax
        mov     ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
        mov     edx, 17
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        call    __assert_fail
 解决方案 
The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation. So there is no need for an MFENCE instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value is guaranteed to be 10.
On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. MFENCE is required to provide both barriers and so it is used at all optimization levels.
Related: Size of store buffers on Intel hardware? What exactly is a store buffer?.
Footnotes:
(1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors.
(2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3.
(3) See the discussion under Peter's answer.
                        这篇关于x86 mfence和C ++内存屏障的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Buffers

x86 mfence和C ++内存屏障

问题描述