问题描述
这是一个常见的声明 将字节存储到缓存中可能会导致内部读取-修改-写入周期,或以其他方式损害吞吐量或延迟,而不是存储一个完整的寄存器.
It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.
但我从未见过任何例子.没有 x86 CPU 是这样的,我认为所有高性能 CPU 也可以直接修改缓存行中的任何字节.如果某些微控制器或低端 CPU 有缓存,它们是否有所不同?
(我不计算字可寻址机器,或者是字节可寻址但缺少字节加载/存储指令的 Alpha.我指的是 ISA 本身支持的最窄存储指令.)
(I'm not counting word-addressable machines, or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)
在我回答时的研究 可以现代x86 硬件不将单个字节存储到内存中?,我发现 Alpha AXP 省略字节存储的原因假定它们将作为真正的字节存储实现到缓存中,而不是包含字的 RMW 更新.(因此它会使 L1d 缓存的 ECC 保护更加昂贵,因为它需要字节粒度而不是 32 位).
In my research while answering Can modern x86 hardware not store a single byte to memory?, I found that the reasons Alpha AXP omitted byte stores presumed they'd be implemented as true byte stores into cache, not an RMW update of the containing word. (So it would have made ECC protection for L1d cache more expensive, because it would need byte granularity instead of 32-bit).
我假设在提交到 L1d 缓存期间的 word-RMW 不被视为其他最近实现字节存储的 ISA 的实现选项.
所有现代架构(早期 Alpha 除外)都可以对不可缓存的 MMIO 区域(不是 RMW 周期)进行真正的字节加载/存储,这是为具有相邻字节 I/O 寄存器的设备编写设备驱动程序所必需的.(例如,使用外部启用/禁用信号来指定更宽总线的哪些部分保存真实数据,例如 上的 2 位 TSIZ(传输大小)这个 ColdFire CPU/微控制器,或者像 PCI/PCIe 单字节传输,或者像掩蔽所选字节的 DDR SDRAM 控制信号.)
All modern architectures (other than early Alpha) can do true byte loads/stores to uncacheable MMIO regions (not RMW cycles), which is necessary for writing device drivers for devices that have adjacent byte I/O registers. (e.g. with external enable/disable signals to specify which parts of a wider bus hold the real data, like the 2-bit TSIZ (transfer size) on this ColdFire CPU/microcontroller, or like PCI / PCIe single byte transfers, or like DDR SDRAM control signals that mask selected bytes.)
也许在缓存中为字节存储执行 RMW 循环是微控制器设计需要考虑的事情,即使它不适用于针对 SMP 服务器/工作站(如 Alpha)的高端超标量流水线设计?
Maybe doing an RMW cycle in cache for byte stores would be something to consider for a microcontroller design, even though it's not for a high-end superscalar pipelined design aimed at SMP servers / workstations like Alpha?
我认为这种说法可能来自字寻址机器.或者来自需要在许多 CPU 上进行多次访问的未对齐 32 位存储,人们错误地将其概括为字节存储.
I think this claim might come from word-addressable machines. Or from unaligned 32-bit stores requiring multiple accesses on many CPUs, and people incorrectly generalizing from that to byte stores.
明确地说,我希望到相同地址的字节存储循环在每次迭代中的运行周期与字存储循环相同.因此,对于填充数组,32 位存储可以比 8 位存储快 4 倍.(如果 32 位存储使内存带宽饱和,而 8 位存储不饱和,则可能更少.)但是除非字节存储有额外的损失,否则您将不会获得 超过 4 倍的速度差异.(或任何字宽).
Just to be clear, I expect that a byte store loop to the same address would run at the same cycles per iterations as a word store loop. So for filling an array, 32-bit stores can go up to 4x faster than 8-bit stores. (Maybe less if 32-bit stores saturate memory bandwidth but 8-bit stores don't.) But unless byte stores have an extra penalty, you won't get more than a 4x speed difference. (Or whatever the word width is).
我说的是 asm.一个好的编译器会在 C 中自动矢量化字节或整数存储循环,并使用更宽的存储或目标 ISA 上的任何最佳存储(如果它们是连续的).
And I'm talking about asm. A good compiler will auto-vectorize a byte or int store loop in C and use wider stores or whatever is optimal on the target ISA, if they're contiguous.
(存储缓冲区中的存储合并也可能导致更广泛地提交到 L1d 缓存以获取连续的字节存储指令,因此这是微基准测试时需要注意的另一件事)
(And store coalescing in the store buffer could also result in wider commits to L1d cache for contiguous byte-store instructions, so that's another thing to watch out for when microbenchmarking)
; x86-64 NASM syntax
mov rdi, rsp
; RDI holds at a 32-bit aligned address
mov ecx, 1000000000
.loop: ; do {
mov byte [rdi], al
mov byte [rdi+2], dl ; store two bytes in the same dword
; no pointer increment, this is the same 32-bit dword every time
dec ecx
jnz .loop ; }while(--ecx != 0}
mov eax,60
xor edi,edi
syscall ; x86-64 Linux sys_exit(0)
或者像这样在 8kiB 数组上循环,每 8 个字节存储 1 个字节或 1 个字(对于 sizeof(unsigned int)=4 和 CHAR_BIT=8 的 C 实现,8kiB,但应该编译为可比较在任何 C 实现上运行,如果 sizeof(unsigned int)
不是 2 的幂,则只有很小的偏差).ASM 的, 要么不展开,要么相同两个版本的展开量.
Or a loop over an 8kiB array like this, storing 1 byte or 1 word out of every 8 bytes (for a C implementation with sizeof(unsigned int)=4 and CHAR_BIT=8 for the 8kiB, but should compile to comparable functions on any C implementation, with only a minor bias if sizeof(unsigned int)
isn't a power of 2). ASM on Godbolt for a few different ISAs, with either no unrolling, or the same amount of unrolling for both versions.
// volatile defeats auto-vectorization
void byte_stores(volatile unsigned char *arr) {
for (int outer=0 ; outer<1000 ; outer++)
for (int i=0 ; i< 1024 ; i++) // loop over 4k * 2*sizeof(int) chars
arr[i*2*sizeof(unsigned) + 1] = 123; // touch one byte of every 2 words
}
// volatile to defeat auto-vectorization: x86 could use AVX2 vpmaskmovd
void word_stores(volatile unsigned int *arr) {
for (int outer=0 ; outer<1000 ; outer++)
for (int i=0 ; i<(1024 / sizeof(unsigned)) ; i++) // same number of chars
arr[i*2 + 0] = 123; // touch every other int
}
根据需要调整大小,如果有人能指出 word_store()
比 byte_store()
更快的系统,我真的很好奇.强>(如果实际进行基准测试,请注意动态时钟速度等预热效果,以及触发 TLB 未命中和缓存未命中的第一遍.)
Adjusting sizes as necessary, I'd be really curious if anyone can point to a system where word_store()
is faster than byte_store()
. (If actually benchmarking, beware of warm-up effects like dynamic clock speed, and the first pass triggering TLB misses and cache misses.)
或者,如果用于古老平台的实际 C 编译器不存在,或者生成不会对存储吞吐量造成瓶颈的次优代码,那么任何手工编写的 asm 都会显示效果.
Or if actual C compilers for ancient platforms don't exist or generate sub-optimal code that doesn't bottleneck on store throughput, then any hand-crafted asm that would show an effect.
任何其他证明字节存储速度变慢的方法都可以,我不坚持在数组上进行跨步循环或在一个单词内进行垃圾邮件写入.
Any other way of demonstrating a slowdown for byte stores is fine, I don't insist on strided loops over arrays or spamming writes within one word.
我也可以提供有关 CPU 内部结构的详细文档,或不同指令的 CPU 周期计时数.不过,我对未经测试就可以基于此声明的优化建议或指南持怀疑态度.
I'd also be fine with detailed documentation about CPU internals, or CPU cycle timing numbers for different instructions. I'm leery of optimization advice or guides that could be based on this claim without having tested, though.
- 任何仍然相关的 CPU 或微控制器缓存字节存储有额外的惩罚?
- 任何仍然相关的 CPU 或微控制器不可缓存字节存储有额外的惩罚?
- 任何不相关的历史 CPU(带有或不带有回写或直写缓存),其中上述任何一项都是正确的?最近的例子是什么?
- Any still-relevant CPU or microcontroller where cached byte stores have an extra penalty?
- Any still-relevant CPU or microcontroller where un-cacheable byte stores have an extra penalty?
- Any not-still-relevant historical CPUs (with or without write-back or write-through caches) where either of the above are true? What's the most recent example?
例如ARM Cortex-A 是这种情况吗??还是皮质-M?任何较旧的 ARM 微架构?任何 MIPS 微控制器或早期的 MIPS 服务器/工作站 CPU?任何其他随机 RISC(如 PA-RISC)或 CISC(如 VAX 或 486)?(CDC6600 是可字寻址的.)
e.g. is this the case on an ARM Cortex-A?? or Cortex-M? Any older ARM microarchitecture? Any MIPS microcontroller or early MIPS server/workstation CPU? Anything other random RISC like PA-RISC, or CISC like VAX or 486? (CDC6600 was word-addressable.)
或者构建一个涉及负载和存储的测试用例,例如显示来自字节存储的 word-RMW 与负载吞吐量竞争.
(我不想证明从字节存储到字加载的存储转发比 word->word 慢,因为 SF 仅在加载完全包含在最近的存储中时才有效工作是正常的接触任何相关的字节.但是一些显示字节->字节转发效率低于字->字 SF 的东西会很有趣,也许字节不以字边界开始.)
(I'm not interested in showing that store-forwarding from byte stores to word loads is slower than word->word, because it's normal that SF only works efficiently when when a load is fully contained in the most recent store to touch any of the relevant bytes. But something that showed byte->byte forwarding being less efficient than word->word SF would be interesting, maybe with bytes that don't start at a word boundary.)
(我没有提到字节加载,因为这通常很容易:从缓存或 RAM 中访问一个完整的字,然后提取您想要的字节.除了 MMIO 之外,该实现细节是无法区分的,其中CPU 绝对不会读取包含的单词.)
(I didn't mention byte loads because that's generally easy: access a full word from cache or RAM and then extract the byte you want. That implementation detail is indistinguishable other than for MMIO, where CPUs definitely don't read the containing word.)
在像 MIPS 这样的加载/存储架构上,处理字节数据只是意味着您使用 lb
或 lbu
加载和零或符号扩展它,然后存储它返回 sb
.(如果您需要在寄存器中的步骤之间截断为 8 位,那么您可能需要额外的指令,因此本地变量通常应该是寄存器大小.除非您希望编译器使用具有 8 位元素的 SIMD 自动矢量化,否则通常 uint8_t本地人很好......)但无论如何,如果你做得对并且你的编译器很好,那么拥有字节数组不应该花费任何额外的指令.
On a load/store architecture like MIPS, working with byte data just means you use lb
or lbu
to load and zero or sign-extend it, then store it back with sb
. (If you need truncation to 8 bits between steps in registers, then you might need an extra instruction, so local vars should usually be register sized. Unless you want the compiler to auto-vectorize with SIMD with 8-bit elements, then often uint8_t locals are good...) But anyway, if you do it right and your compiler is good, it shouldn't cost any extra instructions to have byte arrays.
我注意到 gcc 在 ARM、AArch64、x86 和 MIPS 上具有 sizeof(uint_fast8_t) == 1
.但是 IDK 我们可以投入多少库存.x86-64 System V ABI 将 uint_fast32_t
定义为 x86-64 上的 64 位类型.如果他们打算这样做(而不是 32 位,这是 x86-64 的默认操作数大小),uint_fast8_t
也应该是 64 位类型.也许在用作数组索引时避免零扩展?如果它在寄存器中作为函数 arg 传递,因为如果您无论如何都必须从内存中加载它,它可以免费进行零扩展.
I notice that gcc has sizeof(uint_fast8_t) == 1
on ARM, AArch64, x86, and MIPS. But IDK how much stock we can put in that. The x86-64 System V ABI defines uint_fast32_t
as a 64-bit type on x86-64. If they're going to do that (instead of 32-bit which is x86-64's default operand-size), uint_fast8_t
should also be a 64-bit type. Maybe to avoid zero-extension when used as an array index? If it was passed as a function arg in a register, since it could be zero extended for free if you had to load it from memory anyway.
推荐答案
我的猜测是错误的.现代 x86 微架构在这方面确实与某些(大多数?)其他 ISA 不同.
My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.
即使在高性能非 x86 CPU 上,缓存窄存储也可能会受到惩罚. 缓存占用空间的减少仍然可以使 int8_t
数组值得使用,不过.(在某些 ISA 上,例如 MIPS,不需要为寻址模式缩放索引会有所帮助).
There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t
arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).
在实际提交到 L1d 之前,将字节存储指令之间的存储缓冲区中的存储缓冲区合并/合并到同一字也可以减少或消除惩罚.(x86 有时做不到这一点,因为其强大的内存模型要求所有存储都按程序顺序提交.)
Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)
ARM 的 Cortex 文档-A15 MPCore(从 2012 年开始)说它在 L1d 中使用 32 位 ECC 粒度,并且实际上确实为窄存储做了一个 word-RMW 来更新数据.
ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.
L1 数据缓存在标签和数据阵列中支持可选的单比特纠正和双比特检测纠错逻辑.标签数组的ECC粒度为单个缓存行的标签,数据数组的ECC粒度为32位字.
由于数据数组中的 ECC 粒度,对数组的写入无法更新 4 字节对齐的内存位置的一部分,因为没有足够的信息来计算新的 ECC 值.任何不写入一个或多个对齐的 4 字节内存区域的存储指令都是这种情况.在这种情况下,L1 数据内存系统读取缓存中的现有数据,合并修改后的字节,并根据合并值计算 ECC. L1 内存系统尝试将多个存储合并在一起以满足对齐的 4 字节 ECC 粒度,避免读-修改-写要求.
Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.
(当他们说L1 内存系统"时,我认为他们指的是存储缓冲区,如果您有尚未提交给 L1d 的连续字节存储.)
(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)
请注意,RMW 是原子的,仅涉及被修改的独占缓存行.这是一个不影响内存模型的实现细节. 所以我对 现代 x86 硬件不能将单个字节存储到内存中吗? 仍然(可能)正确,x86 可以,其他所有提供字节存储指令的 ISA 也可以.
Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.
Cortex-A15 MPCore 是 3 路乱序执行CPU,所以它不是最低功耗/简单的 ARM 设计,但他们选择在 OoO exec 上使用晶体管而不是高效的字节存储.
Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.
大概不需要支持高效的未对齐存储(x86 软件更可能假设/利用),较慢的字节存储被认为是值得的,因为 L1d 的 ECC 可靠性更高,而没有过多的开销.
Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.
Cortex-A15 可能不是唯一的,也不是最新的以这种方式工作的 ARM 内核.
Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.
其他示例(由@HadiBrais 在评论中找到):
Alpha 21264(参见 this doc) 的 L1d 缓存具有 8 字节 ECC 粒度.较窄的存储(包括 32 位)在提交到 L1d 时会导致 RMW,如果它们没有首先合并到存储缓冲区中.该文档解释了 L1d 每个时钟可以做什么的全部细节.并特别记录存储缓冲区确实合并存储.
Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.
PowerPC RS64-II 和 RS64-III(请参阅 this 文档).根据这个摘要,RS/6000处理器的L1有7位ECC用于每个 32 位数据.
PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.
Alpha 从一开始就积极地使用 64 位,因此 8 字节的粒度是有道理的,尤其是当 RMW 成本大部分可以被存储缓冲区隐藏/吸收时.(例如,对于该 CPU 上的大多数代码,正常瓶颈可能在别处;它的多端口缓存通常每个时钟可以处理 2 个操作.)
Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)
POWER/PowerPC64 从 32 位 PowerPC 发展而来,并且可能关心运行具有 32 位整数和指针的 32 位代码.(因此更有可能对无法合并的数据结构进行非连续 32 位存储.)因此 32 位 ECC 粒度在那里很有意义.
POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.
这篇关于是否有任何现代 CPU 的缓存字节存储实际上比字存储慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!