我正在围绕获得一段高度一致的代码运行时间进行一些实验。我当前正在计时的代码是一个相当随意的,受CPU约束的工作负载:
int cpu_workload_external_O3(){
int x = 0;
for(int ind = 0; ind < 12349560; ind++){
x = ((x ^ 0x123) + x * 3) % 123456;
}
return x;
}
我编写了一个内核模块,该模块禁用中断,然后运行上述功能的10次尝试,通过获取前后时钟周期计数器的差值来计时每次尝试。其他注意事项:
换句话说,我相信大多数/所有系统可变性的原因都可以解决,尤其是当作为通过
spin_lock_irqsave()
禁用了中断的内核模块运行时,代码在运行时应该获得几乎相同的性能(也许性能很小)首次将某些指令拉入高速缓存时,在第一次运行时就命中了,仅此而已)。确实,当使用
-O3
编译基准测试代码时,我平均看到了大约135,845,192个周期中的最多200个周期,并且大多数试验花费的时间完全相同。但是,当使用-O0
进行编译时,范围从262,710,916上升至158,386个周期。范围是指最长和最短运行时间之间的差。而且,对于-O0
代码,哪个试验最慢/最快并没有太大的一致性-违反直觉,有时候,最快的是第一个,最慢的是后一个!那么:是什么导致
-O0
代码中的可变性如此高的上限?看一下程序集,似乎-O3
代码将所有内容(?)存储在寄存器中,而-O0
代码具有对sp
的大量引用,因此似乎正在访问内存。但是即使那样,我也希望所有内容都可以放入L1缓存中,并以相当确定的访问时间坐在那里。代码
被基准测试的代码在上面的代码段中。组件在下面。两者都使用
gcc 7.4.0
编译,除了-O0
和-O3
之外,没有任何标志。-O0
0000000000000000 <cpu_workload_external_O0>:
0: d10043ff sub sp, sp, #0x10
4: b9000bff str wzr, [sp, #8]
8: b9000fff str wzr, [sp, #12]
c: 14000018 b 6c <cpu_workload_external_O0+0x6c>
10: b9400be1 ldr w1, [sp, #8]
14: 52802460 mov w0, #0x123 // #291
18: 4a000022 eor w2, w1, w0
1c: b9400be1 ldr w1, [sp, #8]
20: 2a0103e0 mov w0, w1
24: 531f7800 lsl w0, w0, #1
28: 0b010000 add w0, w0, w1
2c: 0b000040 add w0, w2, w0
30: 528aea61 mov w1, #0x5753 // #22355
34: 72a10fc1 movk w1, #0x87e, lsl #16
38: 9b217c01 smull x1, w0, w1
3c: d360fc21 lsr x1, x1, #32
40: 130c7c22 asr w2, w1, #12
44: 131f7c01 asr w1, w0, #31
48: 4b010042 sub w2, w2, w1
4c: 529c4801 mov w1, #0xe240 // #57920
50: 72a00021 movk w1, #0x1, lsl #16
54: 1b017c41 mul w1, w2, w1
58: 4b010000 sub w0, w0, w1
5c: b9000be0 str w0, [sp, #8]
60: b9400fe0 ldr w0, [sp, #12]
64: 11000400 add w0, w0, #0x1
68: b9000fe0 str w0, [sp, #12]
6c: b9400fe1 ldr w1, [sp, #12]
70: 528e0ee0 mov w0, #0x7077 // #28791
74: 72a01780 movk w0, #0xbc, lsl #16
78: 6b00003f cmp w1, w0
7c: 54fffcad b.le 10 <cpu_workload_external_O0+0x10>
80: b9400be0 ldr w0, [sp, #8]
84: 910043ff add sp, sp, #0x10
88: d65f03c0 ret
-O3
0000000000000000 <cpu_workload_external_O3>:
0: 528e0f02 mov w2, #0x7078 // #28792
4: 5292baa4 mov w4, #0x95d5 // #38357
8: 529c4803 mov w3, #0xe240 // #57920
c: 72a01782 movk w2, #0xbc, lsl #16
10: 52800000 mov w0, #0x0 // #0
14: 52802465 mov w5, #0x123 // #291
18: 72a043e4 movk w4, #0x21f, lsl #16
1c: 72a00023 movk w3, #0x1, lsl #16
20: 4a050001 eor w1, w0, w5
24: 0b000400 add w0, w0, w0, lsl #1
28: 0b000021 add w1, w1, w0
2c: 71000442 subs w2, w2, #0x1
30: 53067c20 lsr w0, w1, #6
34: 9ba47c00 umull x0, w0, w4
38: d364fc00 lsr x0, x0, #36
3c: 1b038400 msub w0, w0, w3, w1
40: 54ffff01 b.ne 20 <cpu_workload_external_O3+0x20> // b.any
44: d65f03c0 ret
内核模块
运行试验的代码如下。它在每次迭代之前/之后读取
PMCCNTR_EL0
,将差异存储在数组中,并在所有试验的最后打印出最小/最大时间。函数cpu_workload_external_O0
和cpu_workload_external_O3
在单独编译然后链接到的外部目标文件中。#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include "cpu.h"
static DEFINE_SPINLOCK(lock);
void runBenchmark(int (*benchmarkFunc)(void)){
// Enable perf counters.
u32 pmcr;
asm volatile("mrs %0, pmcr_el0" : "=r" (pmcr));
asm volatile("msr pmcr_el0, %0" : : "r" (pmcr|(1)));
// Run trials, storing the time of each in `clockDiffs`.
u32 result = 0;
#define numtrials 10
u32 clockDiffs[numtrials] = {0};
u32 clockStart, clockEnd;
for(int trial = 0; trial < numtrials; trial++){
asm volatile("isb; mrs %0, PMCCNTR_EL0" : "=r" (clockStart));
result += benchmarkFunc();
asm volatile("isb; mrs %0, PMCCNTR_EL0" : "=r" (clockEnd));
// Reset PMCCNTR_EL0.
asm volatile("mrs %0, pmcr_el0" : "=r" (pmcr));
asm volatile("msr pmcr_el0, %0" : : "r" (pmcr|(((uint32_t)1) << 2)));
clockDiffs[trial] = clockEnd - clockStart;
}
// Compute the min and max times across all trials.
u32 minTime = clockDiffs[0];
u32 maxTime = clockDiffs[0];
for(int ind = 1; ind < numtrials; ind++){
u32 time = clockDiffs[ind];
if(time < minTime){
minTime = time;
} else if(time > maxTime){
maxTime = time;
}
}
// Print the result so the benchmark function doesn't get optimized out.
printk("result: %d\n", result);
printk("diff: max %d - min %d = %d cycles\n", maxTime, minTime, maxTime - minTime);
}
int init_module(void) {
printk("enter\n");
unsigned long flags;
spin_lock_irqsave(&lock, flags);
printk("-O0\n");
runBenchmark(cpu_workload_external_O0);
printk("-O3\n");
runBenchmark(cpu_workload_external_O3);
spin_unlock_irqrestore(&lock, flags);
return 0;
}
void cleanup_module(void) {
printk("exit\n");
}
硬件
$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 4
NUMA node(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
BogoMIPS: 166.66
L1d cache: 32K
L1i cache: 48K
L2 cache: 2048K
NUMA node0 CPU(s): 0-15
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2 ONLINE
0 0 0 0 0:0:0 yes
1 0 0 1 1:1:0 yes
2 0 0 2 2:2:0 yes
3 0 0 3 3:3:0 yes
4 0 1 4 4:4:1 yes
5 0 1 5 5:5:1 yes
6 0 1 6 6:6:1 yes
7 0 1 7 7:7:1 yes
8 0 2 8 8:8:2 yes
9 0 2 9 9:9:2 yes
10 0 2 10 10:10:2 yes
11 0 2 11 11:11:2 yes
12 0 3 12 12:12:3 yes
13 0 3 13 13:13:3 yes
14 0 3 14 14:14:3 yes
15 0 3 15 15:15:3 yes
$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 32159 MB
node 0 free: 30661 MB
node distances:
node 0
0: 10
sample 测量
以下是内核模块一次执行的一些输出:
[902574.112692] kernel-module: running on cpu 15
[902576.403537] kernel-module: trial 00: 309983568 74097394 98796602 <-- max
[902576.403539] kernel-module: trial 01: 309983562 74097397 98796597
[902576.403540] kernel-module: trial 02: 309983562 74097397 98796597
[902576.403541] kernel-module: trial 03: 309983562 74097397 98796597
[902576.403543] kernel-module: trial 04: 309983562 74097397 98796597
[902576.403544] kernel-module: trial 05: 309983562 74097397 98796597
[902576.403545] kernel-module: trial 06: 309983562 74097397 98796597
[902576.403547] kernel-module: trial 07: 309983562 74097397 98796597
[902576.403548] kernel-module: trial 08: 309983562 74097397 98796597
[902576.403550] kernel-module: trial 09: 309983562 74097397 98796597
[902576.403551] kernel-module: trial 10: 309983562 74097397 98796597
[902576.403552] kernel-module: trial 11: 309983562 74097397 98796597
[902576.403554] kernel-module: trial 12: 309983562 74097397 98796597
[902576.403555] kernel-module: trial 13: 309849076 74097403 98796630 <-- min
[902576.403557] kernel-module: trial 14: 309983562 74097397 98796597
[902576.403558] kernel-module: min time: 309849076
[902576.403559] kernel-module: max time: 309983568
[902576.403560] kernel-module: diff: 134492
对于每个试验,报告的值为:周期数(0x11),L1D访问次数(0x04),L1I访问次数(0x14)。我正在使用this ARM PMU reference的11.8节。
最佳答案
在最近的Linux内核中,自动NUMA页迁移机制会定期删除TLB条目,以便它可以监视NUMA局部性。即使数据保留在L1DCache中,TLB重装也会减慢O0代码的速度。
页面迁移机制不应在内核页面上激活。
您检查是否启用了自动NUMA页面迁移
$ cat /proc/sys/kernel/numa_balancing
你可以用禁用它
$ echo 0 > /proc/sys/kernel/numa_balancing
关于performance - 是什么导致Cortex-A72上带有-O0而不是-O3的简单紧密循环的周期如此高的变化?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59349267/