本文介绍了英特尔硬件上的存储缓冲区大小?什么是存储缓冲区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

英特尔优化手册讨论了处理器许多部分中存在的存储缓冲区的数量,但似乎没有讨论存储缓冲区的大小.是公共信息,还是存储缓冲区的大小保留为微体系结构细节?

我正在研究的处理器主要是Broadwell和Skylake,但是关于其他处理器的信息也将很好.

另外,存储缓冲区到底是做什么的?

解决方案

相关:什么是存储缓冲区?

如何做存储缓冲区和行填充缓冲区彼此交互?对执行存储指令的步骤以及最终如何将其提交到L1d高速缓存进行了很好的描述.


整个存储缓冲区由多个条目组成.

每个内核都有自己的存储缓冲区,以将执行和退出与提交到L1d缓存的过程分离.即使是有序的CPU也可以从存储缓冲区中受益,避免在高速缓存未命中的存储中停顿,因为与负载不同,它们最终必须变得可见. (没有实际的CPU使用顺序一致性内存模型,因此,即使在x86和SPARC-TSO中,至少也允许StoreLoad重新排序.)

对于推测性/乱序CPU,还可以在检测到较旧指令中的异常或其他错误推测后回滚存储,而不必在全局范围内查看推测性存储.这显然对于正确性至关重要! (您不能回滚其他核心,因此,除非已知它们是非推测性的,否则您不能让它们看到您的商店数据.)


当两个逻辑核心都处于活动状态(超线程)时,Intel将存储缓冲区分为两个;每个逻辑核心得到一半.从一个逻辑核心加载仅监听其自身一半的存储缓冲区. 会发生什么用于线程之间的数据交换,这些线程是否正在使用HT在一个Core上执行?

存储缓冲区按程序顺序将已退休存储指令中的数据尽可能快地提交到L1d中(遵守x86的强序存储模型).要求存储提交 退休时,将不必要地使高速缓存未命中的存储停滞退休.仍在存储缓冲区中的已退休存储肯定会发生并且无法回滚,因此它们实际上会损害中断延迟. (从技术上讲,不需要中断来进行序列化,但是由IRQ处理程序完成的任何存储都必须等到现有挂起的存储被耗尽后才能看到.并且iret正在序列化,因此即使在最佳情况下,存储缓冲区也要在此之前被耗尽返回.)

这是一个常见的误解,即必须显式刷新它才能使数据对其他线程可见.内存屏障不会导致要清空存储缓冲区,完全屏障会使当前核心等待直到存储缓冲区耗尽,然后才允许任何以后发生的负载(即读取L1d).原子RMW操作必须等待存储缓冲区耗尽后才能锁定高速缓存行,并在不使其保持MESI Modified状态的情况下进行加载和存储到该行,从而阻止系统中的任何其他代理在观察过程中对其进行观察.原子操作.

要实现x86的有序内存模型,同时仍通过微体系结构允许早期/无序加载(以及在体系结构允许发生加载时稍后检查数据是否仍然有效),则加载缓冲区+存储缓冲区条目共同形成内存顺序缓冲区(MOB). (如果允许加载时仍然没有高速缓存行 ,则可能是内存顺序的错误推测.)这种结构大概是mfencelock ed指令可以在其中进行的地方放置一个障碍,以阻止StoreLoad重新排序而又不阻止乱序执行. (尽管在Skylake mfence确实阻止独立ALU指令的OoO执行程序,作为实现细节.)

movnt绕过高速缓存的存储区(例如movntps)也将通过存储缓冲区,因此它们可以被视为推测性的,就像OoO exec CPU中的其他所有内容一样.但是它们直接提交给LFB(行填充缓冲区)(又名写合并缓冲区),而不是L1d缓存.


英特尔CPU上的存储指令解码为存储地址和存储数据oups (微融合为一个融合域uop).存储地址uop只是将地址(可能还有存储宽度)写入存储缓冲区,因此以后的加载可以设置存储->负载转发或检测到它们不重叠.存储数据uop写入数据.

存储地址和存储数据可以按任意顺序执行,以先准备好的顺序执行:分配/重命名阶段,将ouop从前端写入到ROB和后端的RS中,也分配负载或存储用于加载的缓冲区或在发布时存储uop .或拖延直到有空.由于分配和提交是按顺序进行的,这可能意味着较老的/较年轻的易于跟踪,因为它可以只是循环缓冲区,而不必担心旧的长寿命条目在回绕后仍在使用. (除非高速缓存绕过/顺序较弱的NT存储可以做到这一点?它们可以不按顺序提交给LFB(行填充缓冲区).与普通存储不同,它们直接提交给LFB进行内核外传输,而不是L1d )


存储缓冲区的大小是按条目而不是位来衡量的.

狭窄的商店不会占用更少的空间",在存储缓冲区中,他们仍然只使用1个条目.

Skylake的存储缓冲区有56个条目( wikichip ),高于Haswell/Broadwell的42,戴维·坎特(David Kanter)在RealWorldTech上撰写的HSW文章中有图表).您可以在RWT上的Kanter文章,Wikichip的图表或其他各种来源中找到最早期的x86 uarches的编号.

SKL/BDW/HSW还具有72个加载缓冲区条目,SnB/IvB具有64个.这是未执行或正在等待数据从外部缓存到达的正在进行中的加载指令的数量.


每个 条目的大小(以位为单位)是一种实现细节,对优化软件的方式几乎没有影响.同样,我们不知道uop的大小(在前端,在ROB中,在RS中),TLB实施细节或许多其他内容,但是我们确实知道有多少ROB和RS条目,以及各个uarch中有多少个不同类型的TLB条目.

Intel不会发布其CPU设计的电路图,而且这些尺寸通常也不为人所知(AFAIK),因此我们甚至无法满足我们对设计细节/权衡取舍的好奇.


在存储缓冲区中写入合并:

在提交之前,可以(可能是?)将在同一缓存行中的背对背窄存储区合并在一起(也可以合并),因此在L1d缓存的写端口上可能只需要一个周期来提交多个存储区

我们肯定知道某些非x86 CPU可以执行此操作,并且我们有一些证据/理由怀疑Intel CPU可以执行此操作.但是,如果发生这种情况,那就是有限的. @BeeOnRope和我目前认为Intel CPU可能进行任何重大合并.如果这样做,最合理的情况是,所有都进入同一高速缓存行的存储缓冲区末尾的条目(准备提交到L1d)可能会合并到一个缓冲区中,如果我们正在等待RFO,则会优化提交该缓存行.请参阅最近的Intel上是否需要两个存储缓冲区条目用于拆分行/页面存储?.我提出了一些可能的实验,但还没有完成.

有关可能的存储缓冲区合并的早期内容:

请参阅以此评论开头的讨论:

还有出乎意料的贫穷和英特尔Skylake上商店循环的奇怪双峰性能可能是相关的.

我们肯定知道某些弱排序的ISA(例如Alpha 21264)确实在其存储缓冲区中存储了合并,因为该手册的文档,以及它在每个周期中可以向L1d提交和/或读取的内容的限制.也是PowerPC RS64-II和RS64-III,其文档中的详细信息较少,链接来自此处的注释:

人们发表了关于如何在TSO内存模型(例如x86)中存储合并的论文(更具侵略性). 商店总订单中的非投机商店合并

如果将其数据复制到同一行,则Coalescing可以允许在其数据提交到L1d之前释放存储缓冲区条目(大概仅在退役之后).仅当没有其他行的存储将它们分开时,才会发生这种情况,否则它将导致存储以程序顺序的方式提交(成为全局可见的),从而违反了内存模型.但是我们认为这可能发生在同一行的任何两个存储中,甚至是第一个和最后一个字节.

这个想法的问题是,SB条目分配可能像ROB一样是一个环形缓冲区.乱序发布条目意味着硬件将需要扫描每个条目以找到空闲的条目,然后,如果它们以乱序方式重新分配,那么它们就不会以程序顺序供以后的存储使用.这可能会使分配和存储转发变得更加困难,因此这似乎不可行.

两个商店缓冲区近期Intel上的分割行/页面存储需要哪些条目?,即使SB条目跨越了缓存行边界,也可以保留一个存储的所有条目.当在离开 SB上提交到L1d缓存时,缓存行边界变得很重要.我们知道,存储转发可用于跨缓存行拆分的存储.如果将它们拆分为商店端口中的多个SB条目,则似乎不太可能.


术语:我一直在使用"coalescing"谈论合并在存储缓冲区中,与写入合并"讨论在有希望地进行不带RFO的全行写入之前将LFB合并的NT存储.或将执行相同操作的区域存储到WC内存区域.

这种区别/约定只是我做的.根据评论中的讨论,这可能不是标准的计算机体系结构术语.

英特尔的手册(尤其是优化手册)由不同的作者撰写多年,并且在术语上也不一致.如果说到Pentium4.关于Sandybridge和Haswell的新部分是可靠的,但较旧的部分可能只提供过时的建议,而该建议仅/主要与P4相关(例如inc与add 1),或者某些优化规则的微体系结构解释可能令人困惑/错误.特别是第3.6.10节写合并".由于内存排序规则,关于在等待行到达高速缓存未命中的存储到WB内存时使用LFB合并存储的第一个要点似乎并不合理.请参阅上面我和BeeOnRope之间的讨论,以及此处的评论.


脚注1:

用于合并内部缓存的写回缓存(或直写)的写合并缓存将具有不同的名称.例如推土机系列使用16k直写式L1d高速缓存和一个4k的小写回缓冲区. (请参阅为什么要使用L1和L2缓存浪费空间来保存相同的数据吗?有关详细信息以及更多详细信息的链接,请参见,用于重写数组的微基准测试在Bulldozer系列CPU上的速度会降低到4k以上.)

脚注2 :某些POWER CPU允许其他SMT线程在存储缓冲区中监听已退休的存储:这可能导致不同的线程不同意其他线程的存储全局顺序. 将两个在不同线程中对不同位置的原子写入总是被其他线程以相同的顺序看到吗?

脚注3 :具有较弱内存模型的非x86 CPU可以按任何顺序提交已淘汰的存储,从而可以更积极地将多个存储合并到同一行,并使不丢失缓存的存储不会停止提交其他商店.

The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail?

The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well.

Also, what do store buffers do, exactly?

解决方案

Related: what is a store buffer?

Also How do the store buffer and Line Fill Buffer interact with each other? has a good description of the steps in executing a store instruction and how it eventually commits to L1d cache.


The store buffer as a whole is composed of multiple entries.

Each core has its own store buffer to decouple execution and retirement from commit into L1d cache. Even an in-order CPU benefits from a store buffer to avoid stalling on cache-miss stores, because unlike loads they just have to become visible eventually. (No practical CPUs use a sequential-consistency memory model, so at least StoreLoad reordering is allowed, even in x86 and SPARC-TSO).

For speculative / out-of-order CPUs, it also makes it possible roll back a store after detecting an exception or other mis-speculation in an older instruction, without speculative stores ever being globally visible. This is obviously essential for correctness! (You can't roll back other cores, so you can't let them see your store data until it's known to be non-speculative.)


When both logical cores are active (hyperthreading), Intel partitions the store buffer in two; each logical core gets half. Loads from one logical core only snoop its own half of the store buffer. What will be used for data exchange between threads are executing on one Core with HT?

The store buffer commits data from retired store instructions into L1d as fast as it can, in program order (to respect x86's strongly-ordered memory model). Requiring stores to commit as they retire would unnecessarily stall retirement for cache-miss stores. Retired stores still in the store buffer are definitely going to happen and can't be rolled back, so they can actually hurt interrupt latency. (Interrupts aren't technically required to be serializing, but any stores done by an IRQ handler can't become visible until after existing pending stores are drained. And iret is serializing, so even in the best case the store buffer drains before returning.)

It's a common(?) misconception that it has to be explicitly flushed for data to become visible to other threads. Memory barriers don't cause the store buffer to be flushed, full barriers make the current core wait until the store buffer drains itself, before allowing any later loads to happen (i.e. read L1d). Atomic RMW operations have to wait for the store buffer to drain before they can lock a cache line and do both their load and store to that line without allowing it to leave MESI Modified state, thus stopping any other agent in the system from observing it during the atomic operation.

To implement x86's strongly ordered memory model while still microarchitecturally allowing early / out-of-order loads (and later checking if the data is still valid when the load is architecturally allowed to happen), load buffer + store buffer entries collectively form the Memory Order Buffer (MOB). (If a cache line isn't still present when the load was allowed to happen, that's a memory-order mis-speculation.) This structure is presumably where mfence and locked instructions can put a barrier that blocks StoreLoad reordering without blocking out-of-order execution. (Although mfence on Skylake does block OoO exec of independent ALU instructions, as an implementation detail.)

movnt cache-bypassing stores (like movntps) also go through the store buffer, so they can be treated as speculative just like everything else in an OoO exec CPU. But they commit directly to an LFB (Line Fill Buffer), aka write-combining buffer, instead of to L1d cache.


Store instructions on Intel CPUs decode to store-address and store-data uops (micro-fused into one fused-domain uop). The store-address uop just writes the address (and probably the store width) into the store buffer, so later loads can set up store->load forwarding or detect that they don't overlap. The store-data uop writes the data.

Store-address and store-data can execute in either order, whichever is ready first: the allocate/rename stage that writes uops from the front-end into the ROB and RS in the back end also allocates a load or store buffer for load or store uops at issue time. Or stalls until one is available. Since allocation and commit happen in-order, that probably means older/younger is easy to keep track of because it can just be a circular buffer that doesn't have to worry about old long-lived entries still being in use after wrapping around. (Unless cache-bypassing / weakly-ordered NT stores can do that? They can commit to an LFB (Line Fill Buffer) out of order. Unlike normal stores, they commit directly to an LFB for transfer off-core, rather than to L1d.)


Store buffer sizes are measured in entries, not bits.

Narrow stores don't "use less space" in the store buffer, they still use exactly 1 entry.

Skylake's store buffer has 56 entries (wikichip), up from 42 in Haswell/Broadwell, and 36 in SnB/IvB (David Kanter's HSW writeup on RealWorldTech has diagrams). You can find numbers for most earlier x86 uarches in Kanter's writeups on RWT, or Wikichip's diagrams, or various other sources.

SKL/BDW/HSW also have 72 load buffer entries, SnB/IvB have 64. This is the number of in-flight load instructions that either haven't executed or are waiting for data to arrive from outer caches.


The size in bits of each entry is an implementation detail that has zero impact on how you optimize software. Similarly, we don't know the size in bits of of a uop (in the front-end, in the ROB, in the RS), or TLB implementation details, or many other things, but we do know how many ROB and RS entries there are, and how many TLB entries of different types there are in various uarches.

Intel doesn't publish circuit diagrams for their CPU designs and (AFAIK) these sizes aren't generally known, so we can't even satisfy our curiosity about design details / tradeoffs.


Write coalescing in the store buffer:

Back-to-back narrow stores to the same cache line can (probably?) be combined aka coalesced in the store buffer before they commit, so it might only take one cycle on a write port of L1d cache to commit multiple stores.

We know for sure that some non-x86 CPUs do this, and we have some evidence / reason to suspect that Intel CPUs might do this. But if it happens, it's limited. @BeeOnRope and I currently think Intel CPUs probably don't do any significant merging. And if they do, the most plausible case is that entries at the end of the store buffer (ready to commit to L1d) that all go to the same cache line might merge into one buffer, optimizing commit if we're waiting for an RFO for that cache line. See discussion in comments on Are two store buffer entries needed for split line/page stores on recent Intel?. I proposed some possible experiments but haven't done them.

Earlier stuff about possible store-buffer merging:

See discussion starting with this comment: Are write-combining buffers used for normal writes to WB memory regions on Intel?

And also Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake may be relevant.

We know for sure that some weakly-ordered ISAs like Alpha 21264 did store coalescing in their store buffer, because the manual documents it, along with its limitations on what it can commit and/or read to/from L1d per cycle. Also PowerPC RS64-II and RS64-III, with less detail, in docs linked from a comment here: Are there any modern CPUs where a cached byte store is actually slower than a word store?

People have published papers on how to do (more aggressive?) store coalescing in TSO memory models (like x86), e.g. Non-Speculative Store Coalescing in Total Store Order

Coalescing could allow a store-buffer entry to be freed before its data commits to L1d (presumably only after retirement), if its data is copied to a store to the same line. This could only happen if no stores to other lines separate them, or else it would cause stores to commit (become globally visible) out of program order, violating the memory model. But we think this can happen for any 2 stores to the same line, even the first and last byte.

A problem with this idea is that SB entry allocation is probably a ring buffer, like the ROB. Releasing entries out of order would mean hardware would need to scan every entry to find a free one, and then if they're reallocated out of order then they're not in program order for later stores. That could make allocation and store-forwarding much harder so it's probably not plausible.

As discussed inAre two store buffer entries needed for split line/page stores on recent Intel?, it would make sense for an SB entry to hold all of one store even if it spans a cache-line boundary. Cache line boundaries become relevant when committing to L1d cache on leaving the SB. We know that store-forwarding can work for stores that split across a cache line. That seems unlikely if they were split into multiple SB entries in the store ports.


Terminology: I've been using "coalescing" to talk about merging in the store buffer, vs. "write combining" to talk about NT stores that combine in an LFB before (hopefully) doing a full-line write with no RFO. Or stores to WC memory regions which do the same thing.

This distinction / convention is just something I made up. According to discussion in comments, this might not be standard computer architecture terminology.

Intel's manuals (especially the optimization manual) are written over many years by different authors, and also aren't consistent in their terminology. Take most parts of the optimization manual with a grain of salt especially if it talks about Pentium4. The new sections about Sandybridge and Haswell are reliable, but older parts might have stale advice that's only / mostly relevant for P4 (e.g. inc vs. add 1), or the microarchitectural explanations for some optimization rules might be confusing / wrong. Especially section 3.6.10 Write Combining. The first bullet point about using LFBs to combine stores while waiting for lines to arrive for cache-miss stores to WB memory just doesn't seem plausible, because of memory-ordering rules. See discussion between me and BeeOnRope linked above, and in comments here.


Footnote 1:

A write-combining cache to buffer write-back (or write-through) from inner caches would have a different name. e.g. Bulldozer-family uses 16k write-through L1d caches, with a small 4k write-back buffer. (See Why do L1 and L2 Cache waste space saving the same data? for details and links to even more details. See Cache size estimation on your system? for a rewrite-an-array microbenchmark that slows down beyond 4k on a Bulldozer-family CPU.)

Footnote 2: Some POWER CPUs let other SMT threads snoop retired stores in the store buffer: this can cause different threads to disagree about the global order of stores from other threads. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?

Footnote 3: non-x86 CPUs with weak memory models can commit retired stores in any order, allowing more aggressive coalescing of multiple stores to the same line, and making a cache-miss store not stall commit of other stores.

这篇关于英特尔硬件上的存储缓冲区大小?什么是存储缓冲区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 12:58