问题描述
我正在检查编译器如何发出x86_64上的多核内存屏障指令。以下代码是我正在使用 gcc_x86_64_8.3
测试的代码。
std :: atomic< bool>标记{false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true,std :: memory_order_release);
}
void get()
{
while(!flag.load(std :: memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std :: thread a {set};
get();
a.join();
}
当我使用 std :: memory_order_seq_cst
,我可以看到 MFENCE
指令与任何优化 -O1,-O2,-O3
一起使用。该指令确保刷新了存储缓冲区,因此在L1D缓存中更新了它们的数据(并使用MESI协议确保其他线程可以看到效果)。
但是,当我使用 在不使用 下面是带有 x86内存排序模型为所有存储指令提供了#StoreStore和#LoadStore障碍,这是发行语义所要求的。另外,处理器将尽快提交存储指令。当存储指令退出时,存储区将成为存储缓冲区中最旧的存储区,核心的目标高速缓存行处于可写一致性状态,并且高速缓存端口可用于执行存储操作。因此,不需要 另一方面,顺序一致性还需要#StoreLoad和#LoadLoad障碍。必须同时提供 脚注: (1)有些例外不适用于此处。特别是,非临时存储和存储到不可缓存的写合并内存类型仅提供#LoadStore障碍。无论如何,这些障碍都为商店提供了Intel和AMD处理器上的回写内存类型。 (2)这与写合并存储相反,后者在某些条件下被设置为全局可见。请参阅英特尔手册第3卷的11.3.1节。 (3)请参见Peter的回答下的讨论。 I'm checking how the compiler emits instructions for multi-core memory barriers on x86_64. The below code is the one I'm testing using When I use However when I use In the case where Below is the assembly code for the get/set functions with The x86 memory ordering model provides #StoreStore and #LoadStore barriers for all store instructions, which is all what the release semantics require. Also the processor will commit a store instruction as soon as possible; when the store instruction retires, the store becomes the oldest in the store buffer, the core has the target cache line in a writeable coherence state, and a cache port is available to perform the store operation. So there is no need for an On the other hand, sequential consistency also requires #StoreLoad and #LoadLoad barriers. Related: Size of store buffers on Intel hardware? What exactly is a store buffer?. Footnotes: (1) There are exceptions that don't apply here. In particular, non-temporal stores and stores to the uncacheable write-combining memory types provide only the #LoadStore barrier. Anyway, these barriers are provided for stores to the write-back memory type on both Intel and AMD processors. (2) This is in contrast to write-combining stores which are made globally-visible under certain conditions. See Section 11.3.1 of the Intel manual Volume 3. (3) See the discussion under Peter's answer. 这篇关于x86 mfence和C ++内存屏障的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! std :: memory_order_release / acquire
而没有优化 MFENCE 也使用code>指令,但是使用
-O1,-O2,-O3
优化,并且看不到其他刷新缓冲区的指令,将省略该指令。 / p>
MFENCE
的情况下,可以确保将存储缓冲区数据提交到高速缓存以确保内存顺序语义?
-O3
的get / set函数的汇编代码,例如我们得到:
set():
mov DWORD PTR any_value [rip],10
mov BYTE PTR标志[rip],1
ret
.LC0:
.string /tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp
.LC1:
.string any_value == 10
get():$ b $b。L8:
movzx eax,BYTE PTR flag [rip]
test al,al
je .L8
cmp DWORD PTR any_value [rip],10
jne .L15
ret
.L15 :
push rax
mov ecx,偏移量:get():: __ PRETTY_FUNCTION__
ed edx,17
mov esi,偏移量:.LC0
mov edi,偏移量平面:.LC1
调用__assert_fail
MFENCE
指令。该标志将尽快对其他线程可见,当它出现时, any_value
保证为10。
MFENCE
来同时提供障碍,因此必须在所有优化级别上使用它。
gcc_x86_64_8.3
.std::atomic<bool> flag {false};
int any_value {0};
void set()
{
any_value = 10;
flag.store(true, std::memory_order_release);
}
void get()
{
while (!flag.load(std::memory_order_acquire));
assert(any_value == 10);
}
int main()
{
std::thread a {set};
get();
a.join();
}
std::memory_order_seq_cst
, I can see the MFENCE
instruction is used with any optimization -O1, -O2, -O3
. This instruction makes sure the store buffers are flushed, therefore updating their data in L1D cache (and using MESI protocol to make sure other threads can see effect). std::memory_order_release/acquire
with no optimizations MFENCE
instruction is also used, but the instruction is omitted using -O1, -O2, -O3
optimizations, and not seeing other instructions that flush the buffers.MFENCE
is not used, what makes sure the store buffer data is committed to cache memory to ensure the memory order semantics?-O3
, like what we get on the Godbolt compiler explorer:set():
mov DWORD PTR any_value[rip], 10
mov BYTE PTR flag[rip], 1
ret
.LC0:
.string "/tmp/compiler-explorer-compiler119218-62-hw8j86.n2ft/example.cpp"
.LC1:
.string "any_value == 10"
get():
.L8:
movzx eax, BYTE PTR flag[rip]
test al, al
je .L8
cmp DWORD PTR any_value[rip], 10
jne .L15
ret
.L15:
push rax
mov ecx, OFFSET FLAT:get()::__PRETTY_FUNCTION__
mov edx, 17
mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:.LC1
call __assert_fail
MFENCE
instruction. The flag will become visible to the other thread as soon as possible and when it does, any_value
is guaranteed to be 10.MFENCE
is required to provide both barriers and so it is used at all optimization levels.