本文介绍了在 x86_64 平台上是否需要 rdtsc 的 mfence?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

unsigned int lo = 0;
unsigned int hi = 0;
__asm__ __volatile__ (
    "mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory"
);

上面代码中的

mfence,有必要吗?

mfence in the above code, is it necessary?

根据我的测试,没有找到 cpu reorder.

Based on my test, cpu reorder is not found.

测试代码片段如下.

inline uint64_t clock_cycles() {
    unsigned int lo = 0;
    unsigned int hi = 0;
    __asm__ __volatile__ (
        "rdtsc" : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

unsigned t1 = clock_cycles();
unsigned t2 = clock_cycles();
assert(t2 > t1);

推荐答案

使用 rdtsc 执行合理测量所需的是序列化指令.

What you need to perform a sensible measurement with rdtsc is a serializing instruction.

众所周知,很多人使用cpuid before rdtsc.
rdtsc需要从abovebelow序列化(阅读:之前的所有指令都必须引退,并且必须在测试代码开始前引退).

As it is well known, a lot of people use cpuid before rdtsc.
rdtsc needs to be serialized from above and below (read: all instructions before it must be retired and it must be retired before the test code starts).

不幸的是,第二个条件经常被忽略,因为 cpuid 是这个任务的一个非常糟糕的选择(它破坏了 rdtsc 的输出).
在寻找替代品时,人们认为名称中带有围栏"的说明就可以了,但这也是不正确的.直接来自英特尔:

Unfortunately the second condition is often neglected because cpuid is a very bad choice for this task (it clobbers the output of rdtsc).
When looking for alternatives people think that instructions that have a "fence" in their names will do, but this is also untrue. Straight from Intel:

MFENCE 不序列化指令流.

几乎序列化的指令是lfence.

简单地说,lfence 确保在任何先前的指令在本地完成之前没有新的指令开始.请参阅我的这个答案以获取有关位置的更详细说明.
它也不会像 mfence 那样耗尽存储缓冲区,也不会像 cpuid 那样破坏寄存器.

Simply put, lfence makes sure that no new instructions start before any prior instruction completes locally. See this answer of mine for a more detailed explanation on locality.
It also doesn't drain the Store Buffer like mfence does and doesn't clobbers the registers like cpuid does.

所以 lfence/rdtsc/lfence 是比 mfence/rdtsc 更好的指令序列,其中 mfence 几乎没用,除非你明确希望在测试开始/结束之前完成之前的存储(但不是在执行 rdstc 之前!).

So lfence / rdtsc / lfence is a better crafted sequence of instructions than mfence / rdtsc, where mfence is pretty much useless unless you explicitly want the previous stores to be completed before the test begins/ends (but not before rdstc is executed!).

如果您检测重新排序的测试是 assert(t2 > t1),那么我相信您不会进行任何测试.
省略 return 和可能会或可能不会阻止 CPU 及时看到第二个 rdtsc 以进行重新排序的调用,不太可能(尽管可能!)CPU 将对两个 rdtsc 重新排序,即使一个紧随其后.

If your test to detect reordering is assert(t2 > t1) then I believe you will test nothing.
Leaving out the return and the call that may or may not prevent the CPU from seeing the second rdtsc in time for a reorder, it is unlikely (though possible!) that the CPU will reorder two rdtsc even if one is right after the other.

假设我们有一个 完全rdtsc 但写成 ecx:ebx.

Imagine we have a rdtsc2 that is exactly like rdtsc but writes ecx:ebx.

执行

rdtsc
rdtsc2

很可能是 ecx:ebx >edx:eax 因为 CPU 没有理由rdtsc 之前执行 rdtsc2.
重新排序不是随机排序,而是指如果当前一条指令无法执行,则寻找其他指令.
但是 rdtsc 不依赖任何之前的指令,所以在 OoO 核心遇到时不太可能延迟.
然而,特殊的内部微架构细节可能会使我的论文无效,因此在我之前的陈述中可能这个词.

is highly likely that ecx:ebx > edx:eax because the CPU has no reason to execute rdtsc2 before rdtsc.
Reordering doesn't mean random ordering, it means look for other instruction if the current one cannot be executed.
But rdtsc has no dependency on any previous instruction, so it's unlikely to be delayed when encountered by the OoO core.
However peculiar internal micro-architectural details may invalidate my thesis, hence the likely word in my previous statement.

我们不需要这条修改后的指令:寄存器重命名就可以了,但如果您不熟悉它,这会有所帮助.

We don't need this altered instruction: register renaming will do it, but in case you are not familiar with it, this will help.

这篇关于在 x86_64 平台上是否需要 rdtsc 的 mfence?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-18 00:43