内存屏障是否确保缓存一致性已经完成?

本文介绍了内存屏障是否确保缓存一致性已经完成?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有两个线程来操作全局变量 x.每个线程(或我想的每个核心)都将有一个 x 的缓存副本.

现在说线程A执行以下指令:

设置 x 为 5一些其他指令

现在当set x to 5被执行时，x的缓存值会被设置为5，这会导致缓存一致性使用 x 的新值来执行和更新其他内核的缓存的协议.

现在我的问题是:当 x 在 Thread A 的缓存中实际设置为 5 时，执行其他内核的缓存在其他指令被执行之前得到更新?还是应该使用内存屏障来确保这一点?:

设置 x 为 5记忆障碍一些其他指令

注意:假设指令是按顺序执行的，也假设当set x to 5被执行时，5立即放入线程 A 的缓存中(因此该指令没有放入队列或稍后执行的内容).

解决方案

存在于 x86 架构上的内存屏障——但这在一般情况下是正确的——不仅保证所有以前的加载，或存储，在执行任何后续加载或存储之前完成 - 它们还保证存储变得全局可见.

全局可见意味着其他缓存感知代理 - 像其他 CPU - 可以看到存储.
如果目标内存已标记为不强制立即写入内存的缓存类型，则其他不知道缓存的代理(例如支持 DMA 的设备)通常不会看到存储.
这与屏障本身无关，这是 x86 架构的一个简单事实:程序员可以看到缓存，并且在处理硬件时，它们通常被禁用.

英特尔在障碍的描述上故意通用，因为它不想将自己与特定的实现联系起来.
您需要抽象地思考:全局可见意味着硬件将采取所有必要步骤使商店全局可见.期间.

要了解障碍，值得一看当前的实现.
请注意，英特尔可以随意将现代实现颠倒过来，只要它保持可见行为正确即可.

x86 CPU 中的存储在内核中执行，然后放置在存储缓冲区中.
例如mov DWORD [eax+ebx*2+4], ecx，一旦解码就停止，直到eax、ebx和ecx 准备好然后将其分派到能够计算其地址的执行单元.
执行完成后，store 变成了一对 (address, value)，被移动到 store 缓冲区.
据说该商店是本地完成(在核心中).

存储缓冲区允许 CPU 的 OoO 部分忘记存储并认为它已完成，即使尚未尝试写入也是如此.

在特定事件(例如序列化事件、异常、屏障的执行或缓冲区耗尽)时，CPU 会刷新存储缓冲区.
刷新始终是按顺序排列的 - 先入先出.

存储从存储缓冲区进入缓存领域.
如果目标地址标有 WC 缓存类型，则它可以合并到另一个称为 Write Combining 缓冲区 的缓冲区中(然后绕过缓存写入内存)，则可以将其写入如果缓存类型为 WB 或 WT，则为 L1D 缓存、L2、L3 或 LLC(如果它不是之前的缓存之一).
如果缓存类型为UC或WT，也可以直接写入内存.

今天，这就是成为全球可见的意思:离开存储缓冲区.
注意两件非常重要的事情:

缓存类型仍然影响可见性.
全局可见并不意味着在内存中可见，而是意味着在其他内核的负载可以看到的地方可见.
如果内存区域是 WB 可缓存的，加载可能会在缓存中结束，所以它在那里是全局可见的——只有代理知道缓存的存在.(但请注意，现代 x86 上的大多数 DMA 都是缓存一致的.
这也适用于非连贯的 WC 缓冲区.
WC 不保持一致——它的目的是将存储合并到顺序无关紧要的内存区域，如帧缓冲区.这还不是真正全局可见，只有在写入组合缓冲区被刷新后，内核之外的任何东西才能看到它.

sfence 正是这样做的:等待所有先前的存储在本地完成，然后清空存储缓冲区.
由于存储缓冲区中的每个存储都可能丢失，因此您会看到此类指令有多么繁重.(但包括后续加载在内的乱序执行可以继续.只有 mfence 会阻止后续加载在全局可见(从 L1d 缓存中读取)，直到存储缓冲区完成提交缓存.)>

但是 sfence 是否会等待存储传播到其他缓存中?
嗯，没有.
因为没有传播 - 让我们从高层次的角度看看写入缓存意味着什么.

缓存在使用 MESI 协议的所有处理器之间保持一致(MESIF 用于多路 Intel 系统，MOESI 用于 AMD 系统).
我们只会看到 MESI.

假设写入索引缓存行 L，并假设所有处理器在其缓存中都具有相同值的行 L.
这一行的状态是Shared，在每个 CPU 中.

当我们的存储进入缓存时，L 被标记为 Modified 并且在内部总线(或多插槽 Intel 系统的 QPI)上进行一个特殊的事务以使其他处理器中的行 L 无效.

如果 L 最初未处于 S 状态，则相应地更改协议(例如，如果 L 处于状态 Exclusive总线完成).

此时写入完成，sfence 完成.

这足以保持缓存的一致性.
当另一个 CPU 请求行 L 时，我们的 CPU 会监听该请求，并将 L 刷新到内存或内部总线中，以便另一个 CPU 读取更新的版本.
L 的状态再次设置为 S.

所以基本上 L 是按需读取的 - 这是有道理的，因为将写入传播到其他 CPU 的成本很高，并且某些架构通过将 L 写回内存来实现(这是可行的，因为另一个 CPU 的 L 处于状态 Invalid 所以它必须从内存中读取它).

最后，sfence 等所有东西通常都没有用，相反，它们非常有用.
只是通常我们不关心其他 CPU 如何看待我们制作我们的存储 - 但是在没有 获取语义 的情况下获取锁，例如在 C++ 中定义的，并用栅栏实现，是完全疯了.

您应该考虑英特尔所说的障碍:它们强制执行内存访问的全局可见性顺序.
您可以通过将障碍视为强制执行顺序或写入缓存来帮助您理解这一点.然后缓存一致性将继续确保对缓存的写入是全局可见的.

我不得不再次强调缓存一致性、全局可见性和内存排序是三个不同的概念.
第一个保证第二个，由第三个强制执行.

内存排序 -- 强制执行 -->全球可见性 -- 需求 -->缓存一致性'.______________________________'_____________.'建筑 ' ''._______________________________________.'微架构

脚注:

按程序顺序.
那是一种简化.在 Intel CPU 上，mov [eax+ebx*2+4], ecx 解码为两个独立的 uops:存储地址和存储数据.store-address uop 必须等到 eax 和 ebx 准备就绪，然后将其分派到能够计算其地址的执行单元.该执行单元将地址写入存储缓冲区，以便稍后加载(按程序顺序)) 可以检查存储转发.
当 ecx 准备好后，store-data uop 可以调度到 store-data 端口，并将数据写入同一个 store 缓冲区条目.
这可能发生在地址已知之前或之后，因为存储缓冲区条目可能按程序顺序保留，因此存储缓冲区(又名内存顺序缓冲区)可以跟踪加载/存储顺序一旦所有内容的地址最终知道，并检查重叠.(对于最终违反 x86 内存排序规则的推测性加载，如果另一个内核使它们在架构上允许加载的最早点之前加载的缓存行无效.这导致清除内存顺序错误推测管道.)p>

Say I have two threads that manipulate the global variable x. Each thread (or each core I suppose) will have a cached copy of x.

Now say that Thread A executes the following instructions:

set x to 5
some other instruction

Now when set x to 5 is executed, the cached value of x will be set to 5, this will cause the cache coherence protocol to act and update the caches of the other cores with the new value of x.

Now my question is: when x is actually set to 5 in Thread A's cache, do the caches of the other cores get updated before some other instruction is executed? Or should a memory barrier be used to ensure that?:

set x to 5
memory barrier
some other instruction

Note: Assume that the instructions were executed in order, also assume that when set x to 5 is executed, 5 is immediately placed in Thread A`'s cache (so the instruction was not placed in a queue or something to be executed later).

解决方案

The memory barriers present on the x86 architecture - but this is true in general - not only guarantee that all the previous loads, or stores, are completed before any subsequent load or store is executed - they also guarantee that the stores have became globally visible.

By globally visible it is meant that other cache-aware agents - like other CPUs - can see the store.
Other agents non aware of the caches - like a DMA capable device - will not usually see the store if the target memory has been marked with a cache type that doesn't enforce an immediate write into memory.
This has nothing to do with the barrier it-self, it is a simple fact of the x86 architecture: caches are visible to the programmer and when dealing with hardware they are usually disabled.

Intel is purposely generic on the description of the barriers because it doesn't want to tie her-self to a specific implementation.
You need to think in abstract: globally visible implies that the hardware will take all the necessary steps to make the store globally visible. Period.

To understand the barriers however it is worth taking a look at the current implementations.
Note that Intel is free to turn the modern implementation up-side down at will, as long it keep the visible behaviour correct.

A store in an x86 CPU is executed in the core, then placed in the store buffer.
For example mov DWORD [eax+ebx*2+4], ecx, once decoded is stalled until eax, ebx and ecx are ready then it is dispatched to an execution unit capable of computing its address.
When the execution is done the store has become a pair (address, value) that is moved into the store buffer.
The store is said to be completed locally (in the core).

The store buffer allows the OoO part of the CPU to forget about the store and consider it completed even if an attempt to write is has not even been made yet.

Upon specific events, like a serialization event, an exception, the execution of a barrier or the exhaustion of the buffer, the CPU flushes the store buffer.
The flush is always in order - First In, First written.

From the store buffer the store enters the realm of the cache.
It can be combined yet into another buffer called the Write Combining buffer (and later written into memory by-passing the caches) if the target address is marked with a WC cache type, it can be written into the L1D cache, the L2, the L3 or the LLC if it is not one of the previous if the cache type is WB or WT.
It can also be written directly in memory if the cache type is UC or WT.

As today that's what it means to become globally visible: leave the store buffer.
Beware of two very important things:

The cache type still influences the visibility.
Globally visible doesn't mean visible in memory, it means visible where loads from other cores will see it.
If the memory region is WB cacheable, the load could end in the cache, so it is globally visible there - only for the agent aware of the existence of the cache. (But note that most DMA on modern x86 is cache-coherent).
This also apply to the WC buffer that is non-coherent.
The WC is not kept coherent - its purpose is to coalesce the stores to memory areas where the order doesn't matter, like a framebuffer. This is not really globally visible yet, only after the write-combining buffer is flushed can anything outside the core see it.

sfence does exactly that: wait for all the previous stores to complete locally and then drains the store buffer.
Since each store in the store buffer can potentially miss, you see how heavy such instruction is. (But out-of-order execution including later loads can continue. Only mfence would block later loads from being globally visible (reading from L1d cache) until after the store buffer finishes committing to cache.)

But does sfence wait for the store to propagates into other caches?
Well, no.
Because there is not propagation - lets see what a write into the cache implies from an high-level perspective.

The cache is kept coherent among all the processors with the MESI protocol (MESIF for multi-socket Intel systems, MOESI for AMD ones).
We will only see MESI.

Suppose the writes indexes the cache line L, and suppose all the processors has this line L in their caches with the same value.
The state of this line is Shared, in every CPU.

When our stores lands in the cache, L is marked as Modified and a special transaction is made on the internal bus (or QPI for multi-socket Intel systems) to invalidate line L in other processors.

If L was not initially in the S state, the protocol is changed accordingly (e.g. if L is in state Exclusive no transactions on the bus are done).

At this point the write is complete and sfence completes.

This is enough to keep the cache coherent.
When another CPU request line L, our CPU snoops that request and L is flushed to memory or into the internal bus so the other CPU will read the updated version.
The state of L is set to S again.

So basically L is read on-demand - this makes sense since propagating the write to other CPU is expensive and some architectures do it by writing L back into memory (this works because the other CPU has L in state Invalid so it must read it from memory).

Finally it is not true that sfence et all are normally useless, on the contrary they are extremely useful.
It is just that normally we don't care how other CPUs see us making our stores - but acquiring a lock without an acquiring semantic as defined, for example, in C++, and implemented with the fences, is totally nuts.

You should think of the barriers as Intel says: they enforce the order of global visibility of memory accesses.
You can help your self understanding this by thinking of the barriers as enforcing the order or writing into the cache. The cache coherence will then take rest of assuring that a write to a cache is globally visible.

I can't help but stress out one more time that cache coherency, global visibility and memory ordering are three different concepts.
The first guarantees the second, that is enforced by the third.

Memory ordering -- enforces --> Global visibility -- needs -> Cache coherency
'.______________________________'_____________.'                            '
                 Architectural  '                                           '
                                 '._______________________________________.'
                                             micro-architectural

Footnotes:

In program order.
That was a simplification. On Intel CPUs, mov [eax+ebx*2+4], ecx decodes into two separate uops: store-address and store-data. The store-address uop has to wait until eax and ebx are ready, then it is dispatched to an execution unit capable of computing its address. That execution unit writes the address into the store buffer, so later loads (in program order) can check for store-forwarding.
When ecx is ready, the store-data uop can dispatch to the store-data port, and write the data into the same store buffer entry.
This can happen before or after the address is known, because the store-buffer entry is reserved probably in program order, so the store buffer (aka memory order buffer) can keep track of load / store ordering once the address of everything is eventually known, and check for overlaps. (And for speculative loads that ended up violating x86's memory ordering rules if another core invalidated the cache line they loaded from before the earliest point they were architecturally allowed to laod. This leads to a memory-order mis-speculation pipeline clear.)

这篇关于内存屏障是否确保缓存一致性已经完成?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！