问题描述
我正在尝试计算运行单个 ASM 指令所需的 CPU 周期数.为了做到这一点,我创建了这个函数:
I'm trying to calculate number of CPU cycles required to run single ASM instruction. In order to do this, I've created this function:
measure_register_op:
# Calculate time of required for movl operation
# function setup
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %edi
xor %edi, %edi
# first time measurement
xorl %eax, %eax
cpuid # sync of threads
rdtsc # result in edx:eax
# we are measuring instuction below
movl %eax, %edi
# second time measurement
cpuid # sync of threads
rdtsc # result in edx:eax
# time difference
sub %eax, %edi
# move to EAX. Value of EAX is what function returns
movl %edi, %eax
# End of function
popl %edi
popl %ebx
mov %ebp, %esp
popl %ebp
ret
我在 *.c 文件中使用它:
I'm using it in *.c file:
extern unsigned int measure_register_op();
int main(void)
{
for (int a = 0; a < 10; a++)
{
printf("Instruction took %u cycles \n", measure_register_op());
}
return 0;
}
问题是:我看到的值太大了.我现在收到 3684414156
.这里会出现什么问题?
The problem is: the values I see are way too large. I'm getting 3684414156
now. What could go wrong here?
从EBX改为EDI,结果还是一样.它必须与 rdtsc 本身有关.在调试器中,我可以看到第二个测量结果为 0x7f61e078 和第一个 0x42999940,减去后仍然给出 1019758392
Changed from EBX to EDI, but result is still similar. It have to be something with rdtsc itself. In debugger I can see that second measurement results with 0x7f61e078 and first 0x42999940, which, after substraction still gives around 1019758392
这是我的makefile.也许我编译不正确:
Here is my makefile. Maybe I'm compiling it incorrectly:
compile: measurement.s measurement.c
gcc -g measurement.s measurement.c -o ./build/measurement -m32
这是我看到的确切结果:
Here is an exact result I see:
Instruction took 4294966680 cycles
Instruction took 4294966696 cycles
Instruction took 4294966688 cycles
Instruction took 4294966672 cycles
Instruction took 4294966680 cycles
Instruction took 4294966688 cycles
Instruction took 4294966688 cycles
Instruction took 4294966696 cycles
Instruction took 4294966688 cycles
Instruction took 4294966680 cycles
推荐答案
在你的更新版本中没有破坏开始时间(错误 @R. 指出):
In your update version that doesn't clobber the start time (the bug @R. pointed out):
sub %eax, %edi
正在计算 start - end
.这是一个负数,即低于 2^32 的一个巨大的无符号数.如果您打算使用 %u
,请习惯于在调试时将其输出解释回位模式.
sub %eax, %edi
is calculating start - end
. This is a negative number, i.e. a huge unsigned number just below 2^32. If you're going to use %u
, get used to interpreting its output back to a bit-pattern when debugging.
你想要end - start
.
顺便说一句,使用lfence
;它比 cpuid
更有效.它保证在英特尔上序列化指令执行(不像完整的序列化指令那样刷新存储缓冲区).它在 启用了 Spectre 缓解的 AMD CPU 上也是安全的.
And BTW, use lfence
; it's significantly more efficient than cpuid
. It's guaranteed to serialize instruction execution on Intel (without flushing the store buffer like a full serializing instruction). It's also safe on AMD CPUs with Spectre mitigation enabled.
另见 http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c 用于序列化 RDTSC 和/或 RDTSCP 的一些不同方法.
See also http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c for some different ways to serialize RDTSC and/or RDTSCP.
另请参阅获取 CPU 周期计数?,了解有关 RDTSC 的更多信息,尤其是它不计算核心时钟周期,只计算参考周期.所以怠速/涡轮增压会影响你的结果.
See also Get CPU cycle count? for more about RDTSC, especially that it doesn't count core clock cycles, only reference cycles. So idle/turbo will affect your results.
此外,一条指令的成本不是一维的.像这样使用 RDTSC 对单个指令进行计时并不是特别有用.请参阅NASM 中的 RDTSCP 总是返回相同的值,了解更多关于如何测量单个指令的吞吐量/延迟/uop.
Also, the cost of one instruction isn't one-dimensional. It's not particularly useful to time a single instruction with RDTSC like that. See RDTSCP in NASM always returns the same value for more about how to measure throughput/latency/uops for a single instruction.
RDTSC 可用于为整个循环或更长的指令序列计时,比 CPU 的 OoO 执行窗口大.
RDTSC can be useful for timing a whole loop or longer sequence of instructions, larger than the OoO execution window of your CPU.
这篇关于使用 RDTSC 测量时差 - 结果太大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!