如果他们不是同一4人小组的成员,那么他们每个人都需要分别阅读%rbx.由于Core2/Nehalem中的寄存器文件只有3个读取端口,因此问题组(四重奏,如Agner Fog所说的那样)停滞不前,直到从寄存器文件中读取了所有它们最近未写入的输入寄存器值(每个周期3个,或者在寻址模式下,Core2上的2个不是3个regs都是索引regs.有关完整详细信息,请参见 Agner Fog的microarch pdf 部分. Core2部分返回到PPro部分. PPro拥有3个范围的管道,因此,在该部分中,Agner谈论的是三胞胎,而不是四重奏.如果同时发布mov和imul,则它们都共享同一读取的%rbx. Core2/Nehalem发生这种情况的几率为四分之三.在提到的两个序列之间进行选择对于英特尔P6系列CPU而言,第一个具有明显的优势(但通常很小),而第二个具有明显的优势. AFAIK与其他CPU没有什么区别,因此选择很明显.mov %rbx, %rcximul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stalls两全其美:mov %rbx, %rcximul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old register如果您要依赖最近写入的寄存器,则不妨只使用 最近写入的寄存器.英特尔Sandybridge系列使用物理寄存器文件(例如AMD Bulldozer系列),并且没有寄存器读取停顿. Ivybridge(第二代Sandybridge)及其更高版本也可以在寄存器重命名时处理mov reg,reg,延迟为零且没有执行单元.这意味着无论您是对临界路径长度进行冲动还是rbx或rcx都无关紧要.但是,AMD Bulldozer系列只能在重命名阶段处理xmm寄存器移动.整数寄存器移动仍然具有1c的延迟.如果延迟是循环每次迭代的周期中的限制因素,则仍有可能值得关注mov属于哪个依赖项链.我认为您可以使用imul %rbx, %rcx而不是imul %rcx, %rcx来组合一个微基准,该基准在Core2上具有寄存器读取停顿.但是,要使mov和imul分到不同的组中进行发布,将需要进行反复试验,并且除非您真的有创造力,否则可能存在一些伪造的外观代码,只能读取大量寄存器. (例如lea (%rsi, %rdi, 1), %eax甚至add (%rsi, %rdi, 1), %eax(必须读取所有三个寄存器,并且对core2/nehalem进行微熔丝处理,因此在问题组中仅占用1个uop插槽.(它在SnB系列上没有微熔丝)).I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns?mov %rbx, %rcximul %rcx, %rcxormov %rbx, %rcximul %rbx, %rcxAny tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly. 解决方案 See Agner Fog's microarch pdf, and his optimizing assembly guide. Also other links in the x86 tag wiki (e.g. Intel's optimization manual).The interesting option you didn't mention is:mov %rbx, %rcximul %rbx, %rbx # doesn'y have to wait for mov to execute# old value of %rbx is still available in %rcxIf the imul is on the critical path, and mov has non-zero latency (like on AMD CPUs, and Intel before IvyBridge), this is potentially better. The result of imul will be ready one cycle earlier, because has no dependency on the result of the mov.If, however, the old value is on the critical path and the squared value isn't, then this is worse because it adds a mov to the critical path.Of course, it also means you have to keep track of the fact that your old variable is now live in a different register, and the old register has the squared value. If this is a problem in a loop, unroll it so you can end up with things where the top of the loop is expecting them. If you wanted this to be easy, you'd use a compiler instead of optimizing asm by hand.However, Intel P6-family CPUs (PPro/PII to Nehalem) have limited register-read ports, so it can be better to favour reading registers that you just wrote. If the %rbx wasn't written in the last couple cycles, it will have to be read from the permanent register file when the mov and imul uops go through the rename&issue stage (the RAT).If they don't issue as part of the same group of 4, then they would each need to read %rbx separately. Since the register file in Core2/Nehalem only has 3 read ports, issue groups (quartets, as Agner Fog calls them) stall until all their not-recently-written input register values are read from the register file (at 3 per cycle, or 2 on Core2 is none of the 3 regs are index regs in an addressing mode).For the full details, see Agner Fog's microarch pdf section 8.8. The Core2 section refers back to the PPro section. PPro has a 3-wide pipeline, so in that section Agner talks about triplets, not quartets.If mov and imul issue together, they both share the same read of %rbx. There's a 3 in 4 chance of this happening on Core2/Nehalem.Choosing just between the sequences you mention the first one has a clear (but usually small) advantage over the second for Intel P6-family CPUs. There's no difference for other CPUs, AFAIK, so the choice is obvious.mov %rbx, %rcximul %rcx, %rcx # uses only the recently-written rcx; can't contribute to register-read stallsworst of both worlds:mov %rbx, %rcximul %rbx, %rcx # can't execute until after the mov, but still reads a potentially-old registerIf you're going to depend on a recently-written register, you might as well use only recently-written registers.Intel Sandybridge-family uses a physical register file (like AMD Bulldozer-family), and doesn't have register-read stalls.Ivybridge (2nd gen Sandybridge) and later also handle mov reg,reg at register rename time, with zero latency and no execution unit. This means it doesn't matter whether you imul rbx or rcx as far as critical path length.However, AMD Bulldozer-family can only handle xmm register moves in its rename stage; integer register moves still have 1c latency.It's potentially still worth caring about which dependency chain the mov is part of, if latency is a limiting factor in the cycles per iteration of a loop.I think you could put together a microbenchmark that has a register read stall on Core2 with imul %rbx, %rcx, but not with imul %rcx, %rcx. However, that would require some trial and error to get the mov and imul to issue in different groups, and unless you're feeling really creative, probably some artificial-looking surrounding code that exists only to read lots of registers. (e.g. lea (%rsi, %rdi, 1), %eax, or even add (%rsi, %rdi, 1), %eax (which has to read all three registers, and does micro-fuse on core2/nehalem so it only takes 1 uop slot in an issue group. (It doesn't micro-fuse on SnB-family)). 这篇关于在x86汇编中,最好使用两个单独的寄存器作为imul?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-17 16:24