问题描述
我正在用 Delphi 编写一个简单的 BigInteger 类型.它主要由 TLimb 的动态数组组成,其中 TLimb 是一个 32 位无符号整数,以及一个 32 位大小的字段,该字段还保存 BigInteger 的符号位.
I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for the BigInteger.
要添加两个 BigInteger,我创建了一个适当大小的新 BigInteger,然后在进行一些簿记之后,调用以下过程,将三个指针传递给左操作数和右操作数以及结果的数组的各自开头,以及左右分别的肢体数.
To add two BigIntegers, I create a new BigInteger of the appropriate size and then, after some bookkeeping, call the following procedure, passing it three pointers to the respective starts of the arrays for the left and right operand and the result, as well as the numbers of limbs for left and right, respectively.
纯代码:
class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
// EAX = Left, EDX = Right, ECX = Result
PUSH ESI
PUSH EDI
PUSH EBX
MOV ESI,EAX // Left
MOV EDI,EDX // Right
MOV EBX,ECX // Result
MOV ECX,RSize // Number of limbs at Left
MOV EDX,LSize // Number of limbs at Right
CMP EDX,ECX
JAE @SkipSwap
XCHG ECX,EDX // Left and LSize should be largest
XCHG ESI,EDI // so swap
@SkipSwap:
SUB EDX,ECX // EDX contains rest
PUSH EDX // ECX contains smaller size
XOR EDX,EDX
@MainLoop:
MOV EAX,[ESI + CLimbSize*EDX] // CLimbSize = SizeOf(TLimb) = 4.
ADC EAX,[EDI + CLimbSize*EDX]
MOV [EBX + CLimbSize*EDX],EAX
INC EDX
DEC ECX
JNE @MainLoop
POP EDI
INC EDI // Do not change Carry Flag
DEC EDI
JE @LastLimb
@RestLoop:
MOV EAX,[ESI + CLimbSize*EDX]
ADC EAX,ECX
MOV [EBX + CLimbSize*EDX],EAX
INC EDX
DEC EDI
JNE @RestLoop
@LastLimb:
ADC ECX,ECX // Add in final carry
MOV [EBX + CLimbSize*EDX],ECX
@Exit:
POP EBX
POP EDI
POP ESI
end;
// RET is inserted by Delphi compiler.
这段代码运行良好,我对它非常满意,直到我注意到,在我的开发设置(iMac 上的 Parallels VM 中的 Win7)中,有一个简单的 PURE PASCAL 加法例程,在模拟进位时执行相同的操作一个变量和几个 if
子句,比我简单、直接的手工汇编程序更快.
This code worked well, and I was pretty satisified with it, until I noticed that, on my development setup (Win7 in a Parallels VM on an iMac) a simple PURE PASCAL addition routine, doing the same while emulating the carry with a variable and a few if
clauses, was faster than my plain, straightforward handcrafted assembler routine.
我花了一段时间才发现在某些 CPU(包括我的 iMac 和旧笔记本电脑)上,DEC
或 INC
和 ADC 的组合
或 SBB
可能会非常慢.但是在我的大多数其他电脑上(我还有五台其他 PC 来测试它,尽管其中四台完全相同),速度相当快.
It took me a while to find out that on certain CPUs (including my iMac and an older laptop), the combination of DEC
or INC
and ADC
or SBB
could be extremely slow. But on most of my others (I have five other PCs to test it on, although four of these are exactly the same), it was quite fast.
所以我写了一个新版本,使用 LEA
和 JECXZ
来模拟 INC
和 DEC
,就像这样:
So I wrote a new version, emulating INC
and DEC
using LEA
and JECXZ
instead, like so:
部分模拟代码:
@MainLoop:
MOV EAX,[ESI + EDX*CLimbSize]
LEA ECX,[ECX - 1] // Avoid INC and DEC, see above.
ADC EAX,[EDI + EDX*CLimbSize]
MOV [EBX + EDX*CLimbSize],EAX
LEA EDX,[EDX + 1]
JECXZ @DoRestLoop // LEA does not modify Zero flag, so JECXZ is used.
JMP @MainLoop
@DoRestLoop:
// similar code for the rest loop
这使我的代码缓慢"机器几乎快三倍,但在更快"的情况下慢了大约 20%.机器.所以现在,作为初始化代码,我做了一个简单的计时循环,并使用它来决定是否将单元设置为调用普通例程或模拟例程.这几乎总是正确的,但有时它应该选择模拟例程时选择(较慢的)普通例程.
That made my code on the "slow" machines almost three times as fast, but some 20% slower on the "faster" machines. So now, as initialzation code, I do a simple timing loop and use that to decide if I'll set up the unit to call the plain or the emulated routine(s). This is almost always correct, but sometimes it chooses the (slower) plain routines when it should have chosen the emulating routines.
但我不知道这是否是最好的方法.
But I don't know if this is the best way to do this.
我给出了我的解决方案,但这里的 asm 专家是否知道避免某些 CPU 运行缓慢的更好方法?
I gave my solution, but do the asm gurus here perhaps know a better way to avoid the slowness on certain CPUs?
Peter 和 Nils 的回答帮助我走上正轨.这是我对 DEC
版本的最终解决方案的主要部分:
Peter and Nils' answers helped me a lot to get on the right track. This is the main part of my final solution for the DEC
version:
纯代码:
class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
PUSH ESI
PUSH EDI
PUSH EBX
MOV ESI,EAX // Left
MOV EDI,EDX // Right
MOV EBX,ECX // Result
MOV ECX,RSize
MOV EDX,LSize
CMP EDX,ECX
JAE @SkipSwap
XCHG ECX,EDX
XCHG ESI,EDI
@SkipSwap:
SUB EDX,ECX
PUSH EDX
XOR EDX,EDX
XOR EAX,EAX
MOV EDX,ECX
AND EDX,$00000003
SHR ECX,2
CLC
JE @MainTail
@MainLoop:
// Unrolled 4 times. More times will not improve speed anymore.
MOV EAX,[ESI]
ADC EAX,[EDI]
MOV [EBX],EAX
MOV EAX,[ESI + CLimbSize]
ADC EAX,[EDI + CLimbSize]
MOV [EBX + CLimbSize],EAX
MOV EAX,[ESI + 2*CLimbSize]
ADC EAX,[EDI + 2*CLimbSize]
MOV [EBX + 2*CLimbSize],EAX
MOV EAX,[ESI + 3*CLimbSize]
ADC EAX,[EDI + 3*CLimbSize]
MOV [EBX + 3*CLimbSize],EAX
// Update pointers.
LEA ESI,[ESI + 4*CLimbSize]
LEA EDI,[EDI + 4*CLimbSize]
LEA EBX,[EBX + 4*CLimbSize]
// Update counter and loop if required.
DEC ECX
JNE @MainLoop
@MainTail:
// Add index*CLimbSize so @MainX branches can fall through.
LEA ESI,[ESI + EDX*CLimbSize]
LEA EDI,[EDI + EDX*CLimbSize]
LEA EBX,[EBX + EDX*CLimbSize]
// Indexed jump.
LEA ECX,[@JumpsMain]
JMP [ECX + EDX*TYPE Pointer]
// Align jump table manually, with NOPs. Update if necessary.
NOP
// Jump table.
@JumpsMain:
DD @DoRestLoop
DD @Main1
DD @Main2
DD @Main3
@Main3:
MOV EAX,[ESI - 3*CLimbSize]
ADC EAX,[EDI - 3*CLimbSize]
MOV [EBX - 3*CLimbSize],EAX
@Main2:
MOV EAX,[ESI - 2*CLimbSize]
ADC EAX,[EDI - 2*CLimbSize]
MOV [EBX - 2*CLimbSize],EAX
@Main1:
MOV EAX,[ESI - CLimbSize]
ADC EAX,[EDI - CLimbSize]
MOV [EBX - CLimbSize],EAX
@DoRestLoop:
// etc...
我删除了很多空白,我想读者可以得到其余的例程.它类似于主循环.速度提升约.较大的 BigInteger 为 20%,较小的 BigInteger 为 10%(只有几个肢体).
I removed a lot of white space, and I guess the reader can get the rest of the routine. It is similar to the main loop. A speed improvement of approx. 20% for larger BigIntegers, and some 10% for small ones (only a few limbs).
64 位版本现在尽可能使用 64 位加法(在主循环以及 Main3 和 Main2 中,它们不像上面那样落入"),而在此之前,64 位比 32 位慢很多,但现在它比 32 位快 30%,是原始简单 64 位循环的两倍.
The 64 bit version now uses 64 bit addition where possible (in the main loop and in Main3 and Main2, which are not "fall-through" like above) and before, 64 bit was quite a lot slower than 32 bit, but now it is 30% faster than 32 bit and twice as fast as the original simple 64 bit loop.
英特尔在其英特尔 64 位和 IA-32 架构优化参考手册中提出,3.5.2.6 Partial Flag Register Stalls -- Example 3-29:
XOR EAX,EAX
.ALIGN 16
@MainLoop:
ADD EAX,[ESI] // Sets all flags, so no partial flag register stall
ADC EAX,[EDI] // ADD added in previous carry, so its result might have carry
MOV [EBX],EAX
MOV EAX,[ESI + CLimbSize]
ADC EAX,[EDI + CLimbSize]
MOV [EBX + CLimbSize],EAX
MOV EAX,[ESI + 2*CLimbSize]
ADC EAX,[EDI + 2*CLimbSize]
MOV [EBX + 2*CLimbSize],EAX
MOV EAX,[ESI + 3*CLimbSize]
ADC EAX,[EDI + 3*CLimbSize]
MOV [EBX + 3*CLimbSize],EAX
SETC AL // Save carry for next iteration
MOVZX EAX,AL
ADD ESI,CUnrollIncrement*CLimbSize // LEA has slightly worse latency
ADD EDI,CUnrollIncrement*CLimbSize
ADD EBX,CUnrollIncrement*CLimbSize
DEC ECX
JNZ @MainLoop
标志保存在AL
中,通过MOVZX
保存在EAX
中.它是通过循环中的第一个 ADD
加入的.然后需要一个 ADC
,因为 ADD
可能会产生一个进位.另见评论.
The flag is saved in AL
, and through MOVZX
in EAX
. It is added in through the first ADD
in the loop. Then an ADC
is needed, because the ADD
might generate a carry. Also see comments.
因为进位保存在EAX
中,所以我也可以使用ADD
来更新指针.循环中的第一个 ADD
也会更新所有标志,因此 ADC
不会受到部分标志寄存器停顿的影响.
Because the carry is saved in EAX
, I can also use ADD
to update the pointers. The first ADD
in the loop also updates all flags, so ADC
won't suffer from a partial flag register stall.
推荐答案
您在旧的 P6 系列 CPU 上看到的是部分标志停顿.
早期的 Sandybridge 系列更有效地处理合并,后来的 SnB 系列(例如 Skylake)根本没有合并成本:需要CF和SPAZO组的一些标志的uop将它们作为2个独立的输入读取.
What you're seeing on old P6-family CPUs is a partial-flag stall.
Early Sandybridge-family handles merging more efficiently, and later SnB-family (e.g. Skylake) has no merging cost at all: uops that need both CF and some flags from the SPAZO group read them as 2 separate inputs.
Intel CPU(P4 除外)分别重命名每个标志位,因此 JNE
仅取决于设置其使用的所有标志的最后一条指令(在这种情况下,仅 Z
标志).事实上,最近的 Intel CPU 甚至可以在内部将 inc/jne
组合成一个 inc-and-branch uop(宏融合).然而,当读取一个未被更新任何标志的最后一条指令修改的标志位时,问题就来了.
Intel CPUs (other than P4) rename each flag bit separately, so JNE
only depends on the last instruction that set all the flags it uses (in this case, just the Z
flag). In fact, recent Intel CPUs can even internally combine an inc/jne
into a single inc-and-branch uop (macro-fusion). However, the trouble comes when reading a flag bit that was left unmodified by the last instruction that updated any flags.
Agner Fog 表示英特尔 CPU(甚至 PPro/PII)不会在 inc 上停滞/jnz
.实际上不是 inc/jnz
停顿了,而是下一次迭代中的 adc
必须读取 inc 之后的
编写了其他标志,但未修改 CF
标志CF
.
Agner Fog says Intel CPUs (even PPro / PII) don't stall on inc / jnz
. It's not actually the inc/jnz
that's stalling, it's the adc
in the next iteration that has to read the CF
flag after inc
wrote other flags but left CF
unmodified.
; Example 5.21. Partial flags stall when reading unmodified flag bits
cmp eax, ebx
inc ecx
jc xx
; Partial flags stall (P6 / PIII / PM / Core2 / Nehalem)
Agner Fog 还更笼统地说:避免使用依赖于 INC 或 DEC 保持进位标志不变这一事实的代码."(对于奔腾 M/Core2/Nehalem).完全避免 inc
/dec
的建议已过时,仅适用于 P4.其他CPU分别重命名EFLAGS的不同部分,只有在需要合并时才会出现问题(读取一个未被上次insn修改的标志写入任何标志).
Agner Fog also says more generally: "Avoid code that relies on the fact that INC or DEC leaves the carry flag unchanged." (for Pentium M/Core2/Nehalem). The suggestion to avoid inc
/dec
entirely is obsolete, and only applied to P4. Other CPUs rename different parts of EFLAGS separately, and only have trouble when merging is required (reading a flag that was unmodified by the last insn to write any flags).
在速度很快的机器上(Sandybridge 及更高版本),当您读取修改它的最后一条指令未写入的位时,它们会插入一个额外的 uop 来合并标志寄存器.这比停顿 7 个周期要快得多,但仍然不理想.
On the machines where it's fast (Sandybridge and later), they're inserting an extra uop to merge the flags register when you read bits that weren't written by the last instruction that modified it. This is much faster than stalling for 7 cycles, but still not ideal.
P4 总是跟踪整个寄存器,而不是重命名部分寄存器,甚至不重命名 EFLAGS.所以 inc/jz
有一个false"依赖于在其之前写入标志的任何内容.这意味着循环条件无法检测到循环的结束,直到 adc
dep 链的执行到达那里,所以当循环分支停止被采用时可能发生的分支错误预测不能早发现.不过,它确实可以防止任何部分标志停顿.
P4 always tracks whole registers, instead of renaming partial registers, not even EFLAGS. So inc/jz
has a "false" dependency on whatever wrote the flags before it. This means that the loop condition can't detect the end of the loop until execution of the adc
dep chain gets there, so the branch mispredict that can happen when the loop-branch stops being taken can't be detected early. It does prevent any partial-flags stalls, though.
你的 lea/jecxz
很好地避免了这个问题.它在 SnB 上较慢,后来因为您根本没有展开循环.您的 LEA 版本是 11 uops(可以每 3 个周期发出一次迭代),而 inc
版本是 7 uops(可以每 2 个周期发出一次迭代),不包括它插入的标志合并 uop拖延.
Your lea / jecxz
avoids the problem nicely. It's slower on SnB and later because you didn't unroll your loop at all. Your LEA version is 11 uops (can issue one iteration per 3 cycles), while the inc
version is 7 uops (can issue one iter per 2 cycles), not counting the flag-merging uop it inserts instead of stalling.
如果 循环
指令并不慢,这将是完美的.它在 AMD Bulldozer 系列(1 m-op,与融合的比较和分支的成本相同)和 Via Nano3000 上实际上很快.不过,这在所有英特尔 CPU 上都很糟糕(在 SnB 系列上为 7 uop).
If the loop
instruction wasn't slow, it would be perfect for this. It's actually fast on AMD Bulldozer-family (1 m-op, same cost as a fused compare-and-branch), and Via Nano3000. It's bad on all Intel CPUs, though (7 uops on SnB-family).
展开时,您可以通过使用指针而不是索引寻址模式获得另一个小收益,因为 2-reg 寻址模式无法在 SnB 及更高版本上进行微熔断.一组load/adc
/store指令没有微融合是6个uop,有微融合只有4个.CPU 可以发出 4 个融合域 uops/clock.(有关此级别的详细信息,请参阅 Agner Fog 的 CPU 微架构文档和指令表.)
When you unroll, you can get another small gain from using pointers instead of indexed addressing modes, because 2-reg addressing modes can't micro-fuse on SnB and later. A group of load/adc
/store instructions is 6 uops without micro-fusion, but only 4 with micro-fusion. CPUs can issue 4 fused-domain uops/clock. (See Agner Fog's CPU microarch doc, and instruction tables, for details on this level.)
尽可能保存 uops,以确保 CPU 发出指令的速度比执行速度快,以确保它可以在指令流中看到足够远的地方,以吸收 insn 提取中的任何气泡(例如分支错误预测).适合 28uop 循环缓冲区也意味着省电(在 Nehalem 上,避免指令解码瓶颈.)有指令对齐和跨越 uop 缓存线边界之类的事情,这使得在没有循环的情况下很难维持完整的 4 uop/时钟缓冲区也是.
Save uops when you can to make sure the CPU can issue instructions faster than execute, to make sure it can see far enough ahead in the instruction stream to absorb any bubbles in insn fetch (e.g. branch mispredict). Fitting in the 28uop loop buffer also means power savings (and on Nehalem, avoiding instruction-decoding bottlenecks.) There are things like instruction alignment and crossing uop cache-line boundaries that make it hard to sustain a full 4 uops / clock without the loop buffer, too.
另一个技巧是保持指向缓冲区末尾的指针,并向上计数为零.(因此,在循环开始时,您将获得第一项为 end[-idx]
.)
Another trick is to keep pointers to the end of your buffers, and count up towards zero. (So at the start of your loop, you get the first item as end[-idx]
.)
; pure loads are always one uop, so we can still index it
; with no perf hit on SnB
add esi, ecx ; point to end of src1
neg ecx
UNROLL equ 4
@MainLoop:
MOV EAX, [ESI + 0*CLimbSize + ECX*CLimbSize]
ADC EAX, [EDI + 0*CLimbSize]
MOV [EBX + 0*CLimbSize], EAX
MOV EAX, [ESI + 1*CLimbSize + ECX*CLimbSize]
ADC EAX, [EDI + 1*CLimbSize]
MOV [EBX + 1*CLimbSize], EAX
; ... repeated UNROLL times. Use an assembler macro to repeat these 3 instructions with increasing offsets
LEA ECX, [ECX+UNROLL] ; loop counter
LEA EDI, [EDI+ClimbSize*UNROLL] ; Unrolling makes it worth doing
LEA EBX, [EBX+ClimbSize*UNROLL] ; a separate increment to save a uop for every ADC and store on SnB & later.
JECXZ @DoRestLoop // LEA does not modify Zero flag, so JECXZ is used.
JMP @MainLoop
@DoRestLoop:
展开 4 应该不错.没必要过分,因为你是概率.将能够使 pre-Haswell 的加载/存储端口饱和,只需展开 3 或 4 个,甚至可能是 2 个.
An unroll of 4 should be good. No need to overdo it, since you're prob. going to be able to saturate the load/store ports of pre-Haswell with an unroll of just 3 or 4, maybe even 2.
展开 2 将使上述循环恰好为 Intel CPU 的 14 个融合域 uops.adc
为 2 ALU(+1 融合内存),jecxz
为 2,其余(包括 LEA)均为 1.在未融合域中,10 个 ALU/分支和 6内存(好吧,如果你真的把存储地址和存储数据分开计算的话,8 个内存).
An unroll of 2 will make the above loop exactly 14 fused-domain uops for Intel CPUs. adc
is 2 ALU (+1 fused memory), jecxz
is 2, the rest (including LEA) are all 1. In the unfused domain, 10 ALU/branch and 6 memory (well, 8 memory if you really count store-address and store-data separately).
- 每次迭代 14 个融合域 uops:每 4 个时钟发出一次迭代.(最后的奇数 2 uop 必须作为一组 2 发出,甚至来自循环缓冲区.)
- 10 ALU &分支 uops:需要 3.33c 在 pre-haswell 上执行它们.我不认为任何一个端口都会成为瓶颈:
adc
的 uops 可以在任何端口上运行,而lea
可以在 p0/p1 上运行.跳转使用 port5(jecx 也使用 p0/p1 之一) - 6 次内存操作:在 Pre-Haswell CPU 上执行需要 3c,每个时钟可以处理 2 个.Haswell 为商店添加了一个专用的 AGU,因此它可以维持 2load+1store/clock.
- 14 fused-domain uops per iteration: issue one iteration per 4 clocks. (The odd 2 uops at the end have to issue as a group of 2, even from the loop buffer.)
- 10 ALU & branch uops: Takes 3.33c to execute them all on pre-haswell. I don't think any one port will be a bottleneck, either:
adc
's uops can run on any port, andlea
can run on p0/p1. The jumps use port5 (and jecx also uses one of p0/p1) - 6 memory operations: Takes 3c to execute on pre-Haswell CPUs, which can handle 2 per clock. Haswell added a dedicated AGU for stores so it can sustain 2load+1store/clock.
因此对于 pre-haswell CPU,使用 LEA/JECXZ,展开 2 不会使 ALU 或加载/存储端口完全饱和.展开 4 将使其达到 22 个融合 uops(要发出 6 个周期).14 ALU&branch:4.66c 执行.12 内存:6 个周期执行.因此,展开 4 将使 pre-Haswell CPU 饱和,但只是勉强.CPU 不会有任何指令缓冲区来处理分支预测错误.
So for pre-haswell CPUs, using LEA/JECXZ, an unroll of 2 won't quite saturate either the ALU or the load/store ports. An unroll of 4 will bring it up to 22 fused uops (6 cycles to issue). 14 ALU&branch: 4.66c to execute. 12 memory: 6 cycles to execute. So an unroll of 4 will saturate pre-Haswell CPUs, but only just barely. The CPU won't have any buffer of instructions to churn through on a branch mispredict.
Haswell 和更高版本将始终在前端遇到瓶颈(每个时钟限制 4 uop),因为 load/adc
/store 组合需要 4 个 uops,并且每个时钟可以维持一个 uops.所以永远没有任何空间"for 循环开销不会减少 adc
吞吐量.这是你必须知道不要过度和展开太多的地方.
Haswell and later will always be bottlenecked on the frontend (4 uops per clock limit), because the load/adc
/store combo takes 4 uops, and can be sustained at one per clock. So there's never any "room" for loop overhead without cutting into adc
throughput. This is where you have to know not to overdo it and unroll too much.
在 Broadwell/Skylake 上,adc
只是具有 1c 延迟的单个 uop,并且加载/adc r, m
/store 似乎是最好的序列. adc m, r/i
是 4 uops.这应该可以维持每个时钟一个 adc,就像 AMD 一样.
On Broadwell / Skylake, adc
is only a single uop with 1c latency, and load / adc r, m
/ store appears to be the best sequence. adc m, r/i
is 4 uops. This should sustain one adc per clock, like AMD.
在 AMD CPU 上,adc
只是一个宏操作,所以如果 CPU 可以维持 4 的问题率(即没有解码瓶颈),那么他们也可以使用 2 负载/1存储端口击败哈斯韦尔.此外,AMD 上的 jecxz
与任何其他分支一样高效:只有一个宏操作.多精度数学是 AMD CPU 擅长的少数事情之一.某些整数指令的较低延迟使它们在某些 GMP 例程中具有优势.
On AMD CPUs, adc
is only one macro-op, so if the CPU can sustain an issue rate of 4 (i.e. no decoding bottlenecks), then they can also use their 2 load / 1 store port to beat Haswell. Also, jecxz
on AMD is as efficient as any other branch: only one macro-op. Multi-precision math is one of the few things AMD CPUs are good at. Lower latencies on some integer instructions give them an advantage in some GMP routines.
展开超过 5 可能会影响 Nehalem 的性能,因为这会使循环大于 28uop 循环缓冲区.然后指令解码会将您限制在每个时钟少于 4 uop.在更早的版本(Core2)上,有一个 64B 的 x86 指令循环缓冲区(64B 的 x86 代码,而不是 uops),这有助于一些解码.
An unroll of more than 5 might hurt performance on Nehalem, because that would make the loop bigger than the 28uop loop buffer. Instruction decoding would then limit you to less than 4 uops per clock. On even earlier (Core2), there's a 64B x86-instruction loop buffer (64B of x86 code, not uops), which helps some with decode.
除非这个 adc
例程是您应用程序中的唯一瓶颈,否则我会将展开因子降低到 2.或者甚至不展开,如果这样可以节省大量序言/尾声代码,并且您的 BigInts 不会太大.当调用者调用许多不同的 BigInteger 函数(例如 add、sub、mul 以及在两者之间执行其他操作时),您不想过多地膨胀代码并创建缓存未命中.如果您的程序没有在每次调用的内部循环中花费很长时间,那么展开太多以在微基准测试中获胜可能会令您措手不及.
Unless this adc
routine is the only bottleneck in your app, I'd keep the unroll factor down to maybe 2. Or maybe even not unroll, if that saves a lot of prologue / epilogue code, and your BigInts aren't too big. You don't want to bloat the code too much and create cache misses when callers call lots of different BigInteger functions, like add, sub, mul, and do other things in between. Unrolling too much to win at microbenchmarks can shoot yourself in the foot if your program doesn't spend a long time in your inner loop on each call.
如果您的 BigInt 值通常不是很大,那么您需要调整的不仅仅是循环.较小的展开可能有助于简化序言/尾声逻辑.确保你检查了长度,这样 ECX 就不会在没有为零的情况下过零,当然.这是展开和向量的麻烦.:/
If your BigInt values aren't usually gigantic, then it's not just the loop you have to tune. A smaller unroll could be good to simplify the prologue/epilogue logic. Make sure you check lengths so ECX doesn't cross zero without ever being zero, of course. This is the trouble with unrolling and vectors. :/
这可能是最有效的方式:
This might be the most efficient way:
lahf
# clobber flags
sahf ; cheap on AMD and Intel. This doesn't restore OF, but we only care about CF
# or
setc al
# clobber flags
add al, 255 ; generate a carry if al is non-zero
使用与 adc dep 链相同的寄存器实际上不是问题:eax
将始终与最后一个 CF
输出的同时准备好代码>adc.(在 AMD 和 P4/Silvermont 上,partial-reg writes 对完整 reg 有一个错误的依赖.他们不会单独重命名部分 reg).保存/恢复是 adc dep 链的一部分,而不是循环条件 dep 链.
Using the same register as the adc dep chain isn't actually a problem: eax
will always be ready at the same time as the CF
output from the last adc
. (On AMD and P4/Silvermont partial-reg writes have a false dep on the full reg. They don't rename partial regs separately). The save/restore is part of the adc dep chain, not the loop condition dep chain.
循环条件只检查由 cmp
、sub
或 dec
写入的标志.保存/恢复它周围的标志不会使它成为 adc
dep 链的一部分,因此可以在 adc
执行到达那里之前检测到循环末尾的分支错误预测.(此答案的先前版本弄错了.)
The loop condition only checks flags written by cmp
, sub
, or dec
. Saving/restoring flags around it doesn't make it part of the adc
dep chain, so the branch mispredict at the end of the loop can be detected before adc
execution gets there. (A previous version of this answer got this wrong.)
几乎可以肯定有一些空间可以在设置代码中删除指令,可能是通过使用值开始的寄存器.您不必将 edi 和 esi 用于指针,尽管我知道当您以与其传统"一致的方式使用寄存器时,它会使初始开发更容易.用.(例如 EDI 中的目标指针).
There's almost certainly some room to shave off instructions in the setup code, maybe by using registers where values start. You don't have to use edi and esi for pointers, although I know it makes initial development easier when you're using registers in ways consistent with their "traditional" use. (e.g. destination pointer in EDI).
Delphi 是否允许您使用 ebp
?很高兴有第 7 个注册.
Does Delphi let you use ebp
? It's nice to have a 7th register.
显然 64 位代码将使您的 BigInt 代码运行速度大约快两倍,即使您必须担心在 64 位 adc 循环结束时执行单个 32b
adc
代码>.它还会给你 2 倍的寄存器数量.
Obviously 64bit code would make your BigInt code run about twice as fast, even though you'd have to worry about doing a single 32b adc
at the end of a loop of 64bit adc
. It would also give you 2x the amount of registers.
这篇关于某些 CPU 上紧密循环中的 ADC/SBB 和 INC/DEC 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!