问题描述
此问题有一个答案,内容为:
为什么CPU设计人员最终会在状态资源中进行重复复制,以同时进行多线程(或Intel上的超线程)?
Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?
为什么相同的资源不能三倍(四倍等)为我们提供三个逻辑核心,因此吞吐量甚至更快?
Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?
重复是研究人员在某种意义上得出了最佳结果吗?,还是只是反映了当前的可能性(晶体管尺寸等)?
Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?
推荐答案
您引用的答案听起来有误.超线程竞争性地共享现有的ALU,缓存和物理寄存器文件.
The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.
在同一内核上一次运行两个线程使它能够发现更多的并行性,以使那些执行单元保持工作量,而不是闲置等待缓存未命中,等待时间和分支错误预测. (请参见现代微处理器90分钟指南!,提供了非常有用的背景知识,以及有关SMT的部分.另外,此答案适用于有关现代超标量/乱序CPU如何查找和利用指令级并行性以每个时钟运行1条以上指令的更多信息.)
Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions. (See Modern MicroprocessorsA 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.)
只需要物理复制或分区一些内容即可跟踪一个内核中两个CPU的体系结构状态,而且大多数情况是在前端(在问题/重命名阶段之前). David Kanter的Haswell文章显示了Sandybridge如何始终对IDQ(已解码的uop队列进行分区),提供问题/重命名阶段),但是当只有一个线程处于活动状态时,IvyBridge和Haswell可以将其用作一个大队列.他还介绍了如何在线程之间竞争性共享缓存.例如,一个Haswell核心具有 168个物理整数寄存器,但是其架构状态每个逻辑CPU仅需要16个.(当然,每个线程的乱序执行都受益于大量寄存器,这就是为什么首先要将寄存器重命名为大的物理寄存器文件的原因.)
Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)
有些东西是静态分区的,例如ROB,以阻止一个线程用依赖于缓存未命中负载的工作来填充后端.
Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.
现代的Intel CPU有太多的执行单元,以至于您只能用经过精心调优的,几乎没有停顿的代码来饱和它们,并且每个时钟运行4个融合域uops.在实践中,这种情况很少见,在手工调整的BLAS库中,矩阵之类的东西之外就可以相乘.
Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.
大多数代码都受益于HT,因为它不能自己使一个完整的内核饱和,因此单个内核的现有资源可以以比一半速度快的速度运行两个线程. (通常快于一半).
Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).
但是,当仅运行单个线程时,该线程可以使用大内核的全部功能.如果您设计具有许多小型内核的多核CPU,这就是您会失败的原因.如果英特尔CPU不实现超线程,则它们可能不会为单个线程包括那么多执行单元.它可以帮助一些单线程工作负载,但是对于HT则可以提供更多帮助.因此,您可能会争辩说这是复制ALU的一种情况,因为该设计支持HT,但这不是必需的.
But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.
奔腾4确实没有足够的执行资源来运行两个完整的线程,而又不会损失更多的钱.其中的一部分可能是跟踪缓存,但它几乎没有执行单元的数量.如每个程序员应该了解的内存(否则仍然有用且相关).预取线程的跟踪缓存占用空间较小,并且会提取到主线程使用的L1D缓存中.当您在没有足够的执行资源来真正实现HT的情况下实施HT时,就会发生这种情况.
Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.
对于每个物理内核只有一个线程即可实现非常高吞吐量的代码,HT根本没有帮助.例如,使前端时钟带宽达到4微秒/时钟周期而不会停顿.
HT doesn't help at all for code that achieves very high throughput with a single thread per physical core. For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.
或者,如果您的代码仅瓶颈在核心的FMA峰值吞吐量之类的东西上(使用10个矢量累加器保持10个FMA处于运行状态).它甚至可能对最终因与另一个线程竞争L1D和L2高速缓存中的空间而导致的额外高速缓存未命中而导致的速度减慢很多的代码造成伤害. (还有uop缓存和L1I缓存).
Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).
使FMA饱和并对其结果做一些处理通常会使用除vfma...
以外的其他指令,因此高吞吐量FP代码通常也接近使前端饱和.
Saturating the FMAs and doing something with the results typically takes some instructions other than vfma...
so high-throughput FP code is often close to saturating the front-end as well.
Agner Fog的microarch pdf 说了同样的道理,即非常精心调整的代码无法从HT中受益,甚至无法受益受到伤害.
Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it.
保罗·克莱顿(Paul Clayton)对这个问题的评论总体上也对SMT设计提出了一些意见.
Paul Clayton's comments on the question also make some good points about SMT designs in general.
如果您有不同的线程在做不同的事情,SMT仍然可以提供帮助.例如高吞吐量FP代码与一个线程共享一个内核,该线程主要执行整数工作,并在分支和缓存未命中停顿很多,这可能会提高整体吞吐量.低吞吐量线程大部分时间都没有使用大部分内核,因此运行另一个使用内核前端和后端资源的其他80%的线程会非常好.
If you have different threads doing different things, SMT can still be helpful. e.g. high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput. The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.
这篇关于状态资源的重复是否被认为是超线程的最佳选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!