本文介绍了“拆分”缓存的含义是什么。它有什么用(如果有用)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当时在问一个关于计算机体系结构的问题,其中提到缓存是一个拆分缓存,没有危险,这到底意味着什么?

解决方案

简介


拆分缓存是由两个物理上分开的部分组成的缓存,其中一部分称为指令缓存专用于保存指令,另一个称为数据缓存,专用于保存数据(即指令存储器操作数)。从逻辑上讲,指令高速缓存和数据高速缓存都被视为单个高速缓存,称为拆分高速缓存,因为它们都是在相同内存级,相同物理地址空间的硬件管理的高速缓存层次结构。指令获取请求仅由指令高速缓存处理,而内存操作数读取和写入请求仅由数据高速缓存处理。未拆分的缓存称为统一缓存。


与的架构区别最初适用于主内存。但是,大多数现代计算机系统都实现了,其中L1缓存实现了哈佛体系结构,其余的存储器层次结构实现了von Neumann体系结构。因此,在现代系统中,哈佛与冯·诺依曼的区别主要适用于L1缓存设计。因此,拆分缓存设计也称为哈佛缓存设计,而统一缓存设计也称为von Neumann。 Wikipedia上有关修改后的哈佛体系结构的文章讨论了,其中一个是


据我所知,James Bell,David Casasent和C.Cordon Bell在其名为,该版本于1974年在IEEE TC杂志上发表(IEEE 更加清晰)。作者发现使用模拟器,对于研究中考虑的几乎所有缓存容量,均等的分割可实现最佳性能(请参见图5)。从论文中得出:

他们还对相同容量的统一缓存设计进行了比较,最初的结论是拆分设计比统一设计没有优势。

实际上,我不清楚论文是否评估了拆分设计或在指令和数据之间分区的缓存。一段说:

(本段由。)


在我看来,作者正在谈论拆分和分区设计。但是尚不清楚在模拟器中实现了什么设计,以及如何配置模拟器进行评估。


请注意,本文并未讨论为何拆分设计的性能可能更好或更差比统一设计。还要注意作者如何使用术语专用缓存。和同质缓存。术语分裂是指。和统一的出现在后来的作品中,我相信Alan Jay Smith首先在于1978年提出。但是我不确定,因为Alan使用这些术语的方式给人的印象是它们已经众所周知。从Alan的论文看来,使用拆分缓存设计的第一个处理器是大约是1975年,第二个处理器可能是(大约1976年) 。这些处理器的工程师可能独立提出了拆分设计思想。


拆分缓存设计的优点


拆分缓存然后在接下来的二十年中对设计进行了广泛的研究。例如,请参见很有影响力的论文。但是很快就认识到,拆分设计对于流水线处理器很有用,在这种流水线处理器中,指令获取单元和存储器访问单元实际上位于芯片的不同部分。利用统一设计,不可能将高速缓存同时放置在指令提取单元和存储器单元附近,从而导致来自一个或两个单元的高高速缓存访​​问等待时间。拆分设计使我们能够将指令高速缓存放置在靠近指令提取单元的位置,将数据高速缓存放置在靠近存储单元的位置,从而同时减少两者的延迟。 (请参见的图3中的S-1处理器的外观文档。)这是拆分设计相对于统一设计的主要优势。这也是拆分设计与支持缓存分区的统一设计之间的关键区别。这就是为什么要进行拆分的数据缓存的原因,正如一些研究工作所建议的那样,例如和。 p>

拆分设计的另一个优点是,它允许并行进行指令和数据访问而不会发生争用。本质上,拆分缓存可以具有统一缓存的两倍带宽。这可以提高流水线处理器的性能,因为指令和数据访问可以在流水线不同阶段的同一周期内发生。或者,可以使用多个访问端口或多个存储库将统一缓存的带宽加倍或提高。实际上,使用两个端口为整个缓存提供了两倍的带宽(相比之下,在拆分设计中,带宽在指令缓存和数据缓存之间被划分为一半),但是添加了另一个端口就面积和功耗而言更昂贵,并且可能会影响延迟。改善带宽的第三种替代方法是在同一端口上添加更多的导线,以便可以在同一周期中访问更多的位,但这可能仅限于同一高速缓存行(与其他两种方法相比)。如果缓存位于芯片外,则将其连接到管线的电线将变为引脚,并且电线数量对面积,功率和延迟的影响会变得更加明显。


In此外,使用统一(L1)缓存的处理器通常包含仲裁逻辑,该仲裁逻辑优先于数据访问而不是指令访问。这种逻辑可以在拆分设计中消除。 (有关避免仲裁的统一设计,请参见下面有关Z80000处理器的讨论。)类似地,如果存在另一个实现该统一设计的缓存级别,则在L2统一缓存中将需要仲裁逻辑。简单的仲裁策略可能会降低性能,而更好的策略可能会增加面积。 [TODO:添加策略示例。]


另一个潜在的好处是,拆分设计使我们可以对指令缓存和数据缓存采用不同的替换策略,这些替换策略可能更适合于访问每个缓存的模式。所有Intel Itanium处理器都对L1I使用LRU策略,对L1D使用NRU策略(我确定这适用于Itanium 2和更高版本,但我不确定第一个Itanium)。此外,从Itanium 9500开始,L1 ITLB使用NRU,而L1 DTLB使用LRU。英特尔没有透露他们为什么决定在这些处理器中使用不同的替换策略。一般来说,在我看来,L1I和L1D使用不同的替换政策并不常见。我找不到与此相关的研究论文(所有有关替换政策的论文都只关注数据或统一缓存)。即使对于统一高速缓存,替换策略在指令行和数据行之间进行区分也可能很有用。在拆分设计中,提取到数据高速缓存中的高速缓存行永远不能替换指令高速缓存中的行。同样,填充到指令高速缓存中的行永远不能替换数据高速缓存中的行。但是,此问题可能发生在统一设计中。


关于修改后的哈佛体系结构与Wikipedia文章中的哈佛和冯·诺伊曼之间的区别,提到Mark I机器对指令和数据存储器使用了不同的存储技术。这使我思考这是否可以构成现代计算机系统中拆分设计的优势。以下是一些表明确实如此的论文:



  • :指令缓存大部分是只读的,除非有未命中的情况,在这种情况下,行必须提取并填充到缓存中。这意味着,与使用STT-RAM进行数据缓存相比,使用STT-RAM(或其他任何NVRAM技术)时,昂贵的写操作发生的频率降低。该论文表明,通过使用SRAM循环高速缓存(如Intel处理器中的LSD)和STT-RAM指令高速缓存,可以显着降低能耗,尤其是在执行完全适合循环高速缓存的循环时。 STT-RAM的非易失性特性使作者能够完全控制门指令缓存而不会丢失其内容。相反,使用SRAM指令高速缓存时,静态能量消耗要大得多,对其进行功率门控会导致其内容丢失。但是,建议的设计会降低性能(与纯SRAM缓存层次结构相比)。

  • :本文还提出将STT-RAM用于指令高速缓存,而数据高速缓存和L2高速缓存仍基于SRAM 。这里没有循环缓存。本文针对的是STT-RAM的高写入延迟问题,这是在高速缓存中填充一行时引起的。这个想法是,当从L2高速缓存接收到一条请求的行时,L1高速缓存首先在为该请求分配的MSHR中缓存该行。 MSHR仍基于SRAM。然后,指令高速缓存行可以直接从MSHR馈送到流水线中,而不必潜在地停止直到被写入STT-RAM高速缓存。与以前的工作类似,提议的体系结构以降低性能为代价来提高能耗。

  • :建议将STT-RAM用于L1数据高速缓存,同时保留所有其他基于SRAM的高速缓存。这样可以减少区域开销和能耗,但是会降低性能。

  • :比较纯(仅SRAM或仅STT-RAM)和混合(L2和指令缓存为STT)的能耗和性能-基于RAM的层次结构。混合缓存层次结构是在纯SRAM和纯STT-RAM层次结构之间进行性能与能量的权衡。


所以我想我们可以说一个拆分设计的优点是我们可以对指令和数据高速缓存使用不同的内存技术。


还有另外两个优点,将在本答案的后面进行讨论。


拆分缓存设计的缺点


拆分设计有其问题。首先,指令和数据缓存的组合空间可能无法有效利用。包含指令和数据的高速缓存行可能同时存在于两个高速缓存中。相反,在统一缓存中,缓存中仅存在该行的单个副本。另外,指令高速缓存和/或数据高速缓存的大小对于所有应用程序或同一应用程序的不同阶段可能不是最佳的。仿真表明,具有相同总大小的统一缓存具有较高的命中率(请参阅稍后讨论的VSC论文)。 这是拆分设计的主要缺点。(如果拆分设计中的单个缓存集上存在放置争用,则该争用可能仍会出现在统一设计中,并且可能产生更糟的影响在这种情况下,拆分设计的总体未命中率会更低。)


第二,自修改代码会导致一致性问题,因此需要在微体系结构中加以考虑-级别和/或软件级别。 (在少数几个周期内,两个缓存之间可能会出现不一致,但是如果ISA不允许这种不一致现象被观察到,则必须在被修改的指令永久更改体系结构状态之前对其进行检测。)一致性需要比统一的逻辑更多的逻辑,并且在拆分设计中对性能的影响更大。


第三,拆分缓存的设计和硬件复杂性与单端口统一缓存相比,完全相同的总体组织参数的完全双端口统一高速缓存和双端口银行高速缓存是重要的考虑因素。根据,全双端口设计具有最大的区域。这与两个端口的类型无关(独占读取,独占写入,读/写)都适用。双端口存储区缓存的面积大于单端口统一缓存。这两个如何与拆分进行比较对我而言并不那么明显。我的理解是,与单端口统一设计相比,拆分设计的面积更大[TODO:解释原因]。重要的是要考虑缓存组织的详细信息,到管道的缓存总线的长度以及处理技术。这里要注意的一件事是,单端口指令高速缓存的容量比单端口数据高速缓存或统一高速缓存的容量低,因为指令高速缓存仅需要互斥读端口,而其他则需要读/写端口。


真实处理器中的统一L1和拆分L2缓存


我不知道过去15年中设计的任何具有统一(L1)缓存的处理器。在现代处理器中,统一设计主要用于数量更大的高速缓存级别,这是有道理的,因为它们没有直接连接到管道。 L2缓存遵循拆分设计的一个有趣示例是Intel Itanium 2 9000处理器。该处理器具有3级缓存层次结构,其中L1和L2缓存均被拆分且对每个内核都是私有的,而L3缓存则在所有内核之间统一并共享。 L2D和L2I缓存的大小分别为256 KB和1 MB。后来的Itanium处理器将L2I大小减小到512 KB。 Itanium 2 9000手册解释了为什么将L2进行拆分:

(我认为反对数据访问被错误地写入了两次。)


该引号的第二段提到了我错过的一个优势较早。分离的L2高速缓存将数据指令冲突点从L2移动到L3。此外,L1缓存中的一些/许多请求可能会在L2中命中,从而使L3争用的可能性降低。


顺便说一句,安腾服务器中的L2I和L2D 2 9000个都使用NRU替换策略。


统一的L1缓存分区


James Bell等。在1974年的论文中提到了在指令和数据之间划分统一缓存的想法。我所知道的唯一提出并评估过这种设计的论文是,于2013年发布。主要缺点拆分设计的一个特点是L1缓存之一可能未得到充分利用,而另一个则可能被过度利用。拆分缓存不允许一个缓存在需要时从另一个缓存中获取空间。这是因为统一设计的L1丢失率低于拆分缓存的总体丢失率的原因(如本文使用模拟所示)。但是,更高的延迟和更低的未命中率对性能的综合影响仍然使具有统一L1缓存的系统比具有拆分缓存的系统慢。


虚拟拆分缓存(VSC) )设计是拆分设计和统一设计之间的中间点。 VSC根据需要在指令和数据之间动态(纵向)划分L1缓存。类似于统一设计,这可以更好地利用L1缓存。但是,VSC甚至具有较低的未命中率,因为分区减少了行保持与指令和行保持数据之间潜在的空间冲突。根据实验结果(所有缓存设计都具有相同的整体容量),即使VSC与统一缓存具有相同的延迟,VSC的性能也与单核系统上的拆分设计相同,并且具有更高的性能。由于较低的未命中率导致访问共享L2缓存时的争用较少,因此具有更高的性能。此外,在单核和多核系统配置中,VSC均由于较低的未命中率而降低了能耗。


VSC的延迟可能低于统一缓存。尽管两者都是双端口的(与单端口拆分缓存具有相同的带宽),但是在VSC设计中,仅接口是双端口的,因为不能同时访问缓存的任何部分。 (本文没有明确说明,但我认为VSC如果同时包含指令和数据,则允许在两个分区中出现同一行,因此它仍然存在拆分设计中存在的一致性问题。)缓存的每个存储体代表一种缓存方式,然后每个存储体都可以在VSC中进行单端口移植。这样可以简化设计(请参见:),并可以减少等待时间。此外,假设统一设计和拆分设计之间的延迟差异很小(因为拆分设计中的指令缓存和数据缓存在物理上彼此接近),则VSC设计可以将指令和数据存储在存储区中。从物理上讲,它们接近管道中需要的位置,并根据为每个存储库分配多少存储库来支持可变延迟访问。存储体的数量越大,直到统一设计的等待时间,等待时间就越高。但是,这需要能够处理这种可变等待时间缓存的管道设计。


我认为本文所缺少的重要一件事是评估具有更高访问延迟的VSC设计。拆分设计(不仅是2个周期对3个周期)。我认为仅将延迟增加一个周期就将使VSC比拆分慢。


Z80000统一缓存的情况


处理器具有标量6级流水线,带有片上单端口统一缓存。高速缓存是16路完全关联和分区的。流水线的每个阶段至少需要两个时钟周期(缓存中未命中的负载以及其他复杂指令可能需要更多周期)。每对连续的时钟周期构成一个处理器周期。 Z80000的缓存设计具有许多我在其他任何地方都看不到的独特属性:



  • 一次最多可以进行两个缓存访问处理器周期,包括最多一条指令提取和最多一次数据访问。但是,高速缓存尽管是统一的和单端口的,但其设计方式是在指令获取和数据访问之间不存在争用。统一缓存的访问延迟为单个时钟周期(等于处理器周期的一半)。在每个处理器周期中,在第一个时钟周期中执行指令提取,并在第二个时钟周期中执行数据访问。在这种情况下,分割缓存不会带来延迟优势,并且对缓存的时分多路访问可提供相同的带宽,而且分割设计的缺点也不存在。完全的关联性最大程度地减少了指令和数据线之间的空间争用。这种设计是由于较小的缓存大小以及相对于缓存延迟相对较浅的流水线而实现的。

  • 系统配置控制长字(SCCL)提供了缓存指令(CI)和缓存数据( CD)控制位。如果CI为1,则可以将缓存中未命中的指令提取填充到缓存中。如果CD为1,则可以在高速缓存中填充高速缓存中未命中的数据负载。高速缓存使用不写不分配策略,因此永远不会在高速缓存中分配写未命中。如果CI和CD都设置为1,则缓存有效地像统一缓存一样工作。如果只有一个标志为1,则高速缓存有效地像仅数据或仅指令高速缓存一样工作。应用程序可以调整这些标志以提高性能。

  • 此属性与问题无关,但我发现它很有趣。 SCCL还提供了缓存替换(CR)控制位。将此位设置为零将禁用未命中的替换,因此永远不会替换行。如果集合中的所有条目都被占用,并且该集合中发生加载/读取未命中,则该行根本不会填充到缓存中。


R3000、80486和Pentium P5


我在SE Retrocomputing上遇到了以下补充问题:。 的第一章第一部分他说,外部高速缓存使用分离设计,以便它可以在相同的时钟阶段执行指令提取和读或写数据访问。我不清楚这到底是如何工作的。我的理解是外部数据和地址总线在两个缓存之间以及与内存共享。 (此外,一些地址线用于将高速缓存行标签提供给片上高速缓存控制器以进行标签匹配。)两个高速缓存都是直接映射的,可能实现了单周期访问延迟。具有相同带宽,关联性和容量的统一外部高速缓存设计要求高速缓存完全是双端口的,或者可以使用VSC设计,但是VSC是在多年后发明的。这样的统一缓存将更加昂贵,并且其延迟可能会比使指令充满流水线所需的单个周期更长。


Retro的链接答案的另一个问题是,这仅仅是因为80486直接从80386演变而来,并不一定意味着它也使用了统一设计。根据英特尔题为,英特尔对这两种设计进行了评估,并故意选择进行统一的片上设计。与同代R3000相比,两个处理器具有相似的频率范围,并且两个处理器中的片外数据宽度均为32位。但是,80486的统一缓存远小于R3000的总缓存容量(最大16KB,最大256KB + 256KB)。另一方面,采用片上芯片使80486拥有更宽的缓存总线更为可行。特别是80486高速缓存具有16字节的指令提取总线,4字节的数据加载总线和4字节的数据加载/存储总线。可以同时使用两条数据总线在一次访问中加载单个8字节操作数(双精度FP操作数或段desc)。 R3000缓存共享一个4字节的总线。 80486高速缓存的大小相对较小,可能使其与单周期等待时间有4种关联。这意味着命中缓存中的加载指令可以在下一个周期中将数据提供给从属指令,而不会造成任何停顿。在R3000上,如果一条指令依赖于紧接在前的加载指令,则在缓存命中的最佳情况下,它必须停顿一个周期。


80486缓存是单缓存,已移植,但指令预取缓冲区和16字节宽的指令取回总线有助于将指令取回和数据访问之间的争用保持在最低水平。英特尔提到仿真结果表明,统一设计提供的命中率高于拆分缓存的命中率,足以补偿bandwdith争用。


英特尔在另一篇题为《 href = https://ieeexplore.ieee.org/abstract/document/393370 rel = noreferrer>英特尔奔腾处理器的设计为什么他们决定更改奔腾中的缓存以进行拆分。原因有两个:(1)2宽超标量奔腾处理器需要在一个周期内最多执行两次数据访问的能力,并且(2)分支预测会增加缓存带宽需求。该论文没有提到英特尔是否考虑使用三端口银行统一高速缓存,但他们可能这样做了,并发现当时不可行,因此他们选择了带双端口8端口数据高速缓存的拆分式高速缓存。和单端口指令缓存。使用当今的晶圆厂技术,三端口统一设计可能会更好。


后来的微体系结构中的更广泛的流水线要求在数据缓存中具有更高的并行度。现在我们在Sunny Cove的4个64字节端口上。


回答问题的第二部分


这可能与提到的结构危害有关在保罗的评论中。也就是说,当时的指令获取单元和存储单元无法访问统一的单端口缓存。


I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?

解决方案

Introduction

A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.

The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.

To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:

They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.

It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:

(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)

It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.

Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.

Advantages of the Split Cache Design

The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.

Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.

In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]

Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.

The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:

  • LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it results in losing its contents. There is, however, a performance penalty with the proposed design (compared to a pure SRAM cache hierarchy).
  • Feasibility exploration of NVM based I-cache through MSHR enhancements: This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM. There is no loop cache here. This paper instead targets the high write latency issue of STT-RAM, which is incurred when a line is filled in the cache. The idea is that when a requested line is received from the L2 cache, the L1 cache first buffers the line in the MSHR allocated for its request. The MSHRs are still SRAM-based. Then the instruction cache line can be fed into the pipeline directly from the MSHR without having to potentially stall until it gets written in the STT-RAM cache. Similar to the previous work, the proposed architecture improves energy consumption at the expense of reduced performance.
  • System level exploration of a STT-MRAM based level 1 data-cache: Proposes using STT-RAM for the L1 data cache while keeping all other caches SRAM-based. This reduces area overhead and energy consumption, but performance is penalized.
  • Loop optimization in presence of STT-MRAM caches: A study of performance-energy tradeoffs: Compares the energy consumption and performance of pure (only SRAM or only STT-RAM) and hybrid (the L2 and instruction cache are STT-RAM-based) hierarchies. The hybrid cache hierarchy a performance-energy tradeoff that is in between the pure SRAM and pure STT-RAM hierarchies.

So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.

There are two other advantages, which will be discussed later in this answer.

Disadvantages of the Split Cache Design

The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate (see the VSC paper discussed later). This is the primary disadvantage of the split design. (If there is a placement contention on a single cache set in the split design, this contention may still occur in the unified design and it may have a worse impact on performance. In such a scenario, the split design would have a lower overall miss rate.)

Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. (An inconsistency may be allowed between the two caches for a small number of cycles, but if the ISA does not allow such inconsistencies to be observable, they have to be detected before the instruction that got modified permanently changes the architectural state.) Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.

Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, the fully dual-ported design has the biggest area. This holds true irrespective of the types of the two ports (exclusive-read, exclusive-write, read/write). The dual-ported banked cache has a higher area than the single-ported unified cache. How these two compare against split is less obvious to me. My understanding is that the split design has a higher area than the single-ported unified design [TODO: Explain why]. It may be important to consider the cache organization details, the lengths of the cache buses to the pipeline, and the process technology. One thing to note here is a single-ported instruction cache has a lower are than a single-ported data cache or unified cache because the instruction cache requires only an exclusive-read port while the others require a read/write port.

Unified L1 and Split L2 Caches in Real Processors

I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline. An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively. Later Itanium processors reduced the L2I size to 512 KB. The Itanium 2 9000 manual explains why the L2 was made split:

(I think "against data accesses" was written twice by mistake.)

The second paragraph from that quote mentions an advantage that I have missed earlier. A split L2 cache moves the data-instruction conflict point from the L2 to the L3. In addition, some/many requests that miss in the L1 caches may hit in the L2, thereby making contention at the L3 less likely.

By the way, the L2I and L2D in the Itanium 2 9000 both use the NRU replacement policy.

Unified L1 Cache Partitioning

James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data, which was published in 2013. The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for reason why the unified design has a low L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation). However, the combined impact on performance of the higher latency and lower miss rate still makes the system with the unified L1 cache slower than the one with the split cache.

The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has even a low miss rate because partitioning reduces potential space conflict between lines holding and instructions and lines holding data. According to the experimental results (all cache designs have the same overall capacity), even if the VSC has the same latency as the unified cache, the VSC has about the same performance as the split design on a single-core system and has a higher performance on a multi-core system because the lower miss rate results in less contention on accessing the shared L2 cache. In addition, in both the single-core and multi-core system configurations, the VSC reduces energy consumption due to the lower miss rate.

A VSC could have a lower latency than a unified cache. Although both are dual ported (to have the same bandwdith as the single-ported split cache), in the VSC design, only the interface needs to be dual ported because no part of the cache can be accessed more than once at the same time. (The paper doesn't explicitly say so, but I think the VSC allows the same line to be present in both partitions if it holds both instructions and data, so it still has the consistency problem that exists in the split design.) Assuming that each bank of the cache represents one cache way, then each bank can be single-ported in VSC. This leads to a simpler design (see: Fast quadratic increase of multiport-storage-cell area with port number) and may allow reducing the latency. Moreover, assuming that the different in latency between the unified design and the split design is small (because the instruction cache and data cache in the split design are physically close to each other), the VSC design can store instructions and data in banks that are physically close to where they are needed in the pipeline and support variable-latency access depending on how many banks are allocated for each. The larger the number of banks, the higher the latency, up to the latency of the unified design. This would require, however, a pipeline design that can handle such variable latency cache.

I think one important thing that this paper is missing is evaluating the VSC design with higher access latencies with respect to the split design (not just 2 cycles vs. 3 cycles). I think increasing the latency by even only one cycle would make VSC slower than split.

The Case of the Z80000 Unified Cache

The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles (loads that miss in the cache and other complex instructions may take more cycles). Each pair of consecutive clock cycles constitutes a processor cycle. The cache design of the Z80000 has a number of unique properties that I've not seen anywhere else:

  • There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite of being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist. The full associativity minimizes space contention between instruction and data lines. This design was made possible by the small cache size and relatively shallow pipeline with respect to the cache latency.
  • The System Configuration Control Longword (SCCL) offers Cache Instruction (CI) and Cache Data (CD) control bits. If CI is 1, instruction fetches that miss in the cache can be filled in the cache. If CD is 1, data loads that miss in the cache can be filled in the cache. The cache uses a write-no-allocate policy so write misses never allocate in the cache. If both CI and CD are set to 1, the cache effectively works like a unified cache. If only one of the flags is 1, the cache effectively works like a data-only or instruction-only cache. Applications can tune these flags to improve performance.
  • This property is not relevant to the question, but I found it interesting. SCCL also offers Cache Replacement (CR) control bit. Setting this bit to zero disables replacement on a miss, so lines are never replaced. If all entries in a set are occupied and a load/fetch miss occurs in that set, the line is simply not filled in the cache.

The Cases of the R3000, 80486, and Pentium P5

I came across the following complementary question on SE Retrocomputing: Why did Intel abandon unified CPU cache?. There are a number of issues with the accepted answer on that question. I'll address these issues here and explain why the 80486 and Pentium caches were designed like that based on information from Intel.

The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance according to Section 1.8 of PaceMips R3000 32-Bit, 25 MHz RISC CPU with Integrated Memory Management Unit. The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. (Also, some of the address wires are used to provide cache line tags to the on-chip cache controller for tag matching.) Both caches are direct-mapped, maybe to achieve a single-cycle access latency. A unified external cache design with the same bandwdith, associativity, and capacity requires the cache to be fully dual-ported, or the VSC design could be used but VSC was invented many years later. Such a unified cache would be more expensive and may have a latency larger than the required single cycle to keep the pipeline filled with instructions.

Another issue with the linked answer from Retro is that just because the 80486 evolved directly from the 80386 doesn't necessarily mean that it has also use the unified design. According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle, Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16KB vs. up to 256KB+256KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses. In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment desc) in one access. The R3000 caches share a single 4-byte bus. The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.

The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwdith contention.

Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwdith demand. The paper doesn't mention whether Intel considered using a triple-ported banked unified cache, but they probably did and found out that it's not feasible at that time, so they went for a split cache with a dual-ported 8-banked data cache and a single-ported instruction cache. With today's fab technology, the triple-ported unified design may be better

Wider pipelines in later microarchitectures required higher parallelism at the data cache. Now we're at 4 64-byte ports in Sunny Cove.

Answering the Second Part of the Question

It's probably about the structural hazard mentioned in Paul's comment. That is, a unified single-ported cache cannot be accessed by the instruction fetch unit and the memory unit at the time.

这篇关于“拆分”缓存的含义是什么。它有什么用(如果有用)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 06:20