本文介绍了为什么clock_gettime(CLOCK_REALTIME,..)上的呼叫延迟有如此大的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确定clock_gettime(CLOCK_REALTIME,...)通话需要多长时间. 过去"我曾经在循环的顶部调用过一次,因为这是一个相当昂贵的调用.但是现在,我希望通过vDSO和一些时钟改进,它可能不会这么慢.

我编写了一些测试代码,这些代码使用__rdtscp来定时对clock_gettime的重复调用(rdtscp调用绕过了调用clock_gettime的循环并将结果加在一起,只是为了避免编译器进行优化太远了.

如果我快速连续调用clock_gettime(),则时间长度从大约45,000个时钟周期下降到500个周期.我认为部分原因可能是第一次调用必须加载vDSO代码的原因(对我而言仍然不完整),但是如何花几个调用才能得到500个我根本无法解释,这种现象似乎不管我如何测试,它都是恒定不变的:

42467
1114
1077
496
455

但是,如果我在clock_gettime调用之间睡眠(一秒钟或十秒钟,没关系),它只会达到大约4.7k个周期的稳定状态:

这里有10秒钟的睡眠时间:

28293
1093
4729
4756
4736

这里有1秒钟的睡眠时间:

61578
855
4753
4741
5645
4753
4732

高速缓存的行为似乎无法描述这一点(在台式机系统上没有做任何事情).我应该为调用clock_gettime安排多少预算?为什么通话变得越来越快?为什么睡一小段时间这么重要?

tl; dr 我想了解调用clock_gettime(CLOCK_REALTIME,...)所花费的时间,不了解为什么快速连续调用而不是两次调用之间运行更快. /p>

更新:这是proc 0上的cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
microcode   : 0x84
cpu MHz     : 2800.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs        :
bogomips    : 5616.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

这是重新创建的测试代码:

#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now() {
    struct timespec s;
    clock_gettime(CLOCK_REALTIME, &s);
    return (s.tv_sec * 1000000000ull) + s.tv_nsec;
}

int main(int argc, char **argv) {
    int sleeptime = atoi(argv[1]);
    int trials = atoi(argv[2]);
    int loops = atoi(argv[3]);

    unsigned long long x, y, n = 0;
    unsigned int d;


    x = __rdtscp(&d);
    n = now();
    asm volatile("": "+r" (n));
    y = __rdtscp(&d);

    printf("init run %lld\n", (y-x));

    for(int t = 0; t < trials; ++t) {
        if(sleeptime > 0) sleep(sleeptime);
        x = __rdtscp(&d);
        for(int l = 0; l < loops; ++l) {
            n = now();
            asm volatile("": "+r" (n));
        }
        y = __rdtscp(&d);
        printf("trial %d took %lld\n", t, (y-x));
    }

    exit(0);
}
解决方案

第一次调用clock_gettime时,页面错误发生在包含该功能指令的页面上.在我的系统上,这是一个软页面错误,需要花费数千个周期(最多10,000个周期)来处理.我的CPU运行在3.4GHz.我认为您的CPU运行频率要低得多,因此处理系统上的页面错误会花费更多时间.但是这里的要点是,第一次调用clock_gettime所花的时间比以后的调用要多得多,这就是您所观察到的.

您的代码所表现出的第二个主要影响是由于指令高速缓存未命中而造成的严重停顿.似乎您只在调用两个函数,即nowprintf,但是这些函数调用其他函数,并且它们都在L1指令高速缓存上竞争.总的来说,这取决于如何在物理地址空间中对齐所有这些功能.当睡眠时间为零秒时,由于指令高速缓存未命中而导致的停顿时间实际上相对较小(您可以使用ICACHE.IFETCH_STALL性能计数器进行测量).但是,当睡眠时间大于零秒时,由于操作系统将安排其他一些线程在同一内核上运行,并且该线程将使用不同的指令和数据,因此该停顿时间会大大增加.这解释了为什么在睡眠时clock_gettime需要更多时间执行.

现在有关第二次和以后的测量.从问题开始:

42467
1114
1077
496
455

我在系统上观察到第二个度量不一定比以后的度量大.我相信在您的系统上也是如此.实际上,当您睡眠10秒钟或1秒时,似乎就是这种情况.在外部循环中,两个函数nowprintf包含数千条动态指令,它们还访问L1数据高速缓存.您在第二次和以后的测量之间看到的变异性是可重现的.因此它是函数本身固有的.注意,rdtscp指令本身的执行时间可能会变化4个周期.另请参见.

实际上,当所需的精度最多为一百万个周期时,clock_gettime很有用.否则,可能会产生误导.

I'm trying to time how long clock_gettime(CLOCK_REALTIME,...) takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.

I wrote some test code that used __rdtscp to time repeated calls to clock_gettime (the rdtscp calls went around a loop that called clock_gettime and added the results together, just so the compiler wouldn't optimize too much away).

If I call clock_gettime() in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:

42467
1114
1077
496
455

However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:

Here at 10 second sleeps:

28293
1093
4729
4756
4736

Here at 1 second sleeps:

61578
855
4753
4741
5645
4753
4732

Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?

tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...) don't understand why it runs faster when called in quick succession as opposed to a second between calls.

Update: Here's the cpuinfo on proc 0

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping    : 9
microcode   : 0x84
cpu MHz     : 2800.000
cache size  : 6144 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs        :
bogomips    : 5616.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Here is the recreated test code:

#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>

// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops

unsigned long long now() {
    struct timespec s;
    clock_gettime(CLOCK_REALTIME, &s);
    return (s.tv_sec * 1000000000ull) + s.tv_nsec;
}

int main(int argc, char **argv) {
    int sleeptime = atoi(argv[1]);
    int trials = atoi(argv[2]);
    int loops = atoi(argv[3]);

    unsigned long long x, y, n = 0;
    unsigned int d;


    x = __rdtscp(&d);
    n = now();
    asm volatile("": "+r" (n));
    y = __rdtscp(&d);

    printf("init run %lld\n", (y-x));

    for(int t = 0; t < trials; ++t) {
        if(sleeptime > 0) sleep(sleeptime);
        x = __rdtscp(&d);
        for(int l = 0; l < loops; ++l) {
            n = now();
            asm volatile("": "+r" (n));
        }
        y = __rdtscp(&d);
        printf("trial %d took %lld\n", t, (y-x));
    }

    exit(0);
}
解决方案

The very first time clock_gettime is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime will take much more time than later calls, which is what you are observing.

The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now and printf, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime takes more time to execute.

Now regarding the second and later measurements. From the question:

42467
1114
1077
496
455

I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now and printf contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp instruction itself may vary by 4 cycles. See also this.

In practice, the clock_gettime is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.

这篇关于为什么clock_gettime(CLOCK_REALTIME,..)上的呼叫延迟有如此大的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-01 15:02