问题描述
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times 81920 db "A"
根据 perf
,它以每周期 1.82 条指令运行.我不明白为什么它这么快.毕竟,它必须存储在内存(RAM)中,所以它应该很慢.
According to perf
it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.
P.S 是否存在循环携带依赖?
P.S Is there any loop-carried-dependency?
section .text
%define n 100000
_start:
xor rcx, rcx
jmp .cond
.begin:
movnti [array+rcx], eax
.cond:
add rcx, 1
cmp rcx, n
jl .begin
section .data
array times n dq 0
现在,迭代每次迭代需要 5 个周期.为什么?毕竟,仍然没有循环携带依赖.
Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.
推荐答案
movnti
在重复写入同一地址时,显然可以维持每个时钟一个的吞吐量.
movnti
can apparently sustain a throughput of one per clock when writing to the same address repeatedly.
我认为 movnti
不断写入相同的 填充缓冲区,它不会经常被刷新,因为没有其他加载或存储发生.(该链接是关于使用 SSE4.1 NT 加载从 WC 视频内存复制,以及使用 NT 存储存储到普通内存.)
I think movnti
keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)
因此,NT 写入组合填充缓冲区就像一个缓存,用于将多个重叠的 NT 存储存储到同一地址,并且写入实际上是在填充缓冲区中进行,而不是每次都进入 DRAM.
So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.
DDR DRAM 仅支持突发传输命令.如果每个 movnti
产生一个实际上对内存芯片可见的 4B 写入,它就不可能运行得那么快.由于 没有非突发写入命令.另请参阅 Ulrich Drepper 的每个程序员应该了解的关于内存的内容.
DDR DRAM only supports burst-transfer commands. If every movnti
produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.
我们可以通过一次在多个内核上运行测试来进一步证明这种情况.由于它们根本不会相互减慢速度,因此我们可以确定写入只是很少从 CPU 内核中退出并竞争内存周期.
We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.
您的实验没有显示您的循环以每时钟 4 条指令(每次迭代一个周期)运行的原因是您使用了如此小的重复计数.100k 个周期几乎不占启动开销(perf
的时间包括在内).
The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf
's timing includes).
例如,在具有双通道 DDR2 533MHz 的 Core2 E6600 (Merom/Conroe) 上,包括所有进程启动/退出内容在内的总时间为 0.113846 毫秒.这只有 266,007 个周期.
For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.
一个更合理的微基准测试显示每个周期一次迭代(一个movnti
):
A more reasonable microbenchmark shows one iteration (one movnti
) per cycle:
global _start
_start:
xor ecx,ecx
.begin:
movnti [array], eax
dec ecx
jnz .begin ; 2^32 iterations
mov eax, 60 ; __NR_exit
xor edi,edi
syscall ; exit(0)
section .bss
array resb 81920
(asm-link
is a script I wrote)
$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address
Performance counter stats for './movnti-same-address':
1835.056710 task-clock (msec) # 0.995 CPUs utilized
4,398,731,563 cycles # 2.397 GHz
12,891,491,495 instructions # 2.93 insns per cycle
1.843642514 seconds time elapsed
并行运行:
$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &
real 0m1.844s / user 0m1.828s # running alone
[1] 12523
[2] 12524
peter@tesla:~/src/SO$
real 0m1.855s / user 0m1.824s # running together
real 0m1.984s / user 0m1.808s
# output compacted by hand to save space
我期望完美的 SMP 扩展(超线程除外),最多可扩展到任意数量的内核.例如在 10 核 Xeon 上,此测试的 10 个副本可以同时运行(在不同的物理内核上),并且每个副本都将在同一时间完成,就像单独运行一样.(不过,如果您测量挂钟时间而不是周期计数,则单核 Turbo 与多核 Turbo 也将是一个因素.)
I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)
zx485 的 uop 计数很好地解释了为什么循环不会受到前端或未融合域执行资源的瓶颈.
zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.
然而,这反驳了他关于 CPU 与内存时钟的比率与之有任何关系的理论.然而,有趣的巧合是,OP 选择的计数恰好使最终的总 IPC 以这种方式计算.
However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.
P.S 是否存在循环携带依赖?
是的,循环计数器.(1 个周期).顺便说一句,您可以通过使用 dec
/jg
倒数到零来保存 insn,而不是向上计数并且必须使用 cmp
.
Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec
/ jg
instead of counting up and having to use a cmp
.
write-after-write 内存依赖不是正常意义上的真正"依赖,但它是 CPU 必须跟踪的东西.CPU 不会注意到"重复写入相同的值,因此它必须确保最后一次写入是计数"的那个.
The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".
这称为建筑危害.我认为这个术语在谈论内存而不是寄存器时仍然适用.
This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.
这篇关于在重复存储到同一地址的循环中,为什么 MOVNTI 不慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!