

Core A将值x写入到storebuffer中,等待无效的ack,然后将x刷新到高速缓存中.它只等待一个ack还是等待所有ack?以及它如何确定所有CPU中有多少个托架?

Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?



It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.

在这种情况下,由于存储缓冲区中的存储尚不全局可见,因此存储缓冲区中的存储通常可以自由地忽略来自其他内核的此类无效操作.仅当他们在退休后的某个时间提交给L1时,该存储才成为全局可见的.此时,如果缓存中还没有关联的行,则缓存控制器将对其进行RFO(所有权请求).从本质上讲,商店在全球范围内可见. L1缓存控制器不需要知道还有多少其他失效,因为它们是作为MESI协议的一部分由系统中的一些更高级别的组件介导的,并且当它们处于E状态时,它们保证他们是专有所有者.

In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.

简而言之,来自其他内核的失效对存储缓冲区中的存储几乎没有影响,因为基于RFO请求,它们在单个点上变得全局可见.是执行该区域的 loads 更有可能是由另一个内核上的无效活动造成的,尤其是在x86这样的强大平台上,不允许进行可见的负载-负载重新排序.例如,x86上的所谓MOB负责跟踪失效是否可能破坏排序规则.

In short, invalidations from other cores have little effect on stores in the store buffer, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.


Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.


This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.

大多数CPU都是分层的,各种不同的代理协同工作以确保一致性.在实践中,这意味着CPU不需要等待N CPU系统上其他N-1内核的最多N-1个"acks",而只需要来自上级组件的单个答复即可.负责发送和收集其他CPU的响应.

Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.


One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).


Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!

例如,在英特尔的Broadwell Xeon架构中,有完全四种不同的可配置监听模式:

For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:


... with different performance tradeoffs:


The rest that document goes into some detail about how the various modes work.


So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".


Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.


Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.


07-22 14:28