本文介绍了可能被覆盖的值的内联汇编约束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在调试一些汇编代码,在阅读了一些文档后,我不确定我是否 100% 理解约束.我想知道是否有人可以让我直截了当.如果我有以下代码(arm32):

I'm debugging some assembly code, and after reading some documentation, I'm not sure I understand constraints 100%. I was wondering if someone could set me straight. If I have the following code (arm32):

int foo(int in1, int *ptr1) {
    int out1=123;

    asm volatile (
        "   cmp     %[in1],  #0;"
        "   bne     1b;"
        "   dmb;"
        "   mov     %[out1], #0;"
        "1: strex   %[in1], [%[ptr1]];"
        : [out1]"=r", [ptr1]"+r"(ptr1),
        : [in1]"r"(in1),
        : "memory" );

    return out1;
}

我不清楚一些事情:首先,我将 out1 标记为输出,但仅当 in1 为零时它才是输出.我担心 =r 约束被解释为这个值总是设置",告诉优化器任何以前的值都是无关紧要的.当然,我不确定如何为可能改变的东西编写约束......

I'm unclear of a few things: First, I mark out1 as being an output, but it is only an output if in1 is zero. I'm worried that =r constraint is being interpreted as 'this value is always set', telling the optimizer that any previous value is irrelevant. Of course, I'm not sure how I would write a constraint for something that might change...

我也关心ptr1.指针本身实际上并未设置,但它指向的是什么.我想知道这是否应该有读取约束,并想知道是否有适当的方法来设置此约束.

I'm also concerned with ptr1. The pointer itself is not actually set, but what it points to is. I'm wondering if this should have a read constraint, and wondering if there's a proper way to set this constraint.

请注意,我在多个编译器(gcc、clang 以及每个编译器的各种版本)上使用此代码,因此我想避免对特定优化器进行任何假设.

Note, I am using this code on multiple compilers (gcc, and clang, and various versions of each), so I'd like to avoid any assumptions about specific optimizers.

推荐答案

没错,"=r" 表示只写.该寄存器在输入时无效.编译器不会在 asm 之前在选定的寄存器中放置任何特定的东西,因为它会被覆盖.编译器会优化,就像你在内联 asm 之外编写 out1 = asm_result; 一样.

That's correct, "=r" means write-only. The register is dead on input. The compiler won't bother putting anything specific in the selected register before the asm, because it's going to be overwritten. The compiler will optimize like if you wrote out1 = asm_result; outside inline asm.

"+r" 是输入/输出操作数.如果它可能被修改,你需要编译器假设它一直都是.

"+r" is an input/output operand. If it might be modified, you need the compiler to assume that it always has been.

查看编译器生成的函数的 asm,例如在 Godbolt 编译器浏览器上.(https://godbolt.org/).您可以看到编译器在您的内联 asm 周围生成了哪些代码,包括在内联到另一个函数之后.

Look at the compiler-generated asm for the function, e.g. on the Godbolt compiler explorer. (https://godbolt.org/). You can see what code the compiler generates around your inline asm, including after inlining into another function.

我也关心ptr1.指针本身实际上并没有被设置,但它指向的是什么.

是的,您的担心是正确的."+r"(ptr1) 告诉编译器指针值被修改,但暗示指向的值被修改."memory" 破坏器是一种很重的方法,或者正如 Jester 所说,你应该只使用 "=m"(*ptr1) 约束来代替编译器选择一种寻址方式,并告诉它指向的内存是无条件写入的.

Yes, you are right to be concerned. "+r"(ptr1) tells the compiler the pointer value is modified, but does not imply that the pointed-to value is modified. The "memory" clobber is a heavy way to do that, or as Jester says, you should just use an "=m"(*ptr1) constraint instead to let the compiler pick an addressing mode, and tell it that the pointed-to memory is unconditionally written.

如果没有前面的 LDREX,STREX 是否有意义?我不这么认为,但如果我错了,那么你只需要内联 asm 来处理那一条指令,因为 ARM 编译器只使用普通的 str 甚至原子存储.

Does STREX even make sense without a preceding LDREX? I don't think so, but if I'm wrong then you only need inline asm for that one instruction, because ARM compilers just use plain str even for atomic stores.

如果这个函数完成了 LL/SC 的第二部分,那就很奇怪了.

If this function does the 2nd half of a LL/SC, then that's pretty weird.

您确定不能使用内置的 __atomic_store(ptr1, value, __ATOMIC_RELAXED) + 可选屏障或 C11 atomic_store_explicit 做您想做的事吗?

Are you sure you can't do what you want with a built-in __atomic_store(ptr1, value, __ATOMIC_RELAXED) + optional barrier, or a C11 atomic_store_explicit?

#include <stdatomic.h>
int foo(int in1, int *ptr1) {
    int out1=123;

    if (in1 != 0) {
        out1 = 0;
        //asm("dmb" ::: "memory");
        atomic_thread_fence(memory_order_release);  // make the following stores release-stores wrt. earlier operations
    }
    atomic_store_explicit((_Atomic int*)ptr1, in1, memory_order_relaxed);
    return out1;
}

使用 gcc6.3 编译,在 Godbolt 编译器浏览器上:

Compiles with gcc6.3, on the Godbolt compiler explorer:

@ gcc6.3 -O3 -mcpu=cortex-a53  (ARM mode)
foo:
        subs    r3, r0, #0    @ copy in1 and set flags from it at the same time
        moveq   r0, #123      @ missed-optimization: since we still branch, no point hoisting this out of the if with predication
        bne     .L5
        str     r3, [r1]      @ if()-not-taken path
        bx      lr
.L5:
        dmb     ish           @ if()-taken path
        mov     r0, #0        @ makes the moveq doubly silly, because we do it again inside the branch.
        str     r3, [r1]
        bx      lr          @ out1 return value in r0

所以它运行与您的实现相同的指令(除了 str 而不是 strex),但它的分支不同,使用尾部重复并且可能整体保存指令(可能具有更大的代码大小但更低的动态指令计数,因为我们使用了 -O3.)使用 -Os,我们得到了非常紧凑的 asm,更像你的内联 asm(跳过 mov 和 dmb).

So it runs the same instructions as your implementation (except str instead of strex), but it branches differently, using tail duplication and probably saves instructions overall (with maybe larger code-size but lower dynamic instruction count, because we used -O3.) With -Os, we get very compact asm that's more like your inline-asm (jumping over a mov and a dmb).

Clang 使整个事物无分支,使用 itte(在拇指模式下)来断言 dmbne sy.(请参阅它在 Godbolt 上的输出.)

Clang makes the whole thing branchless, using an itte (in thumb mode) to predicate the dmbne sy. (See its output on Godbolt.)

请注意,如果要将其移植到 AArch64,单独的屏障通常效率较低.您希望编译器能够使用 AArch64 的 stlr 发布存储(即使它是顺序发布,而不是较弱的普通发布).dmb ish 是一个完整的内存屏障.此外,ARMv8 的 32 位代码可以使用 stl.

Note that a separate barrier is typically less efficient if you want to port this to AArch64. You want the compiler to be able to use AArch64's stlr release store (even though it's a sequential-release, not a weaker plain release). dmb ish is a full memory barrier. Also, 32-bit code for ARMv8 can use stl.

请注意,完整的 dmb 将订购 other 以后的商店.较早的存储,因此这在 AArch64(或 32 位,具有 ARMv8 指令可用)上并不完全等效,其中编译器生成的代码不使用 dmb.

Note that a full dmb will order other later stores wrt. earlier stores, so this isn't exactly equivalent on AArch64 (or 32-bit with ARMv8 instructions available), where compiler-generated code doesn't use a dmb.

此版本编译为适用于所有架构的非常好的 asm: 我看到的一个遗漏优化是编译器无法将 dmb 分开str,在条件 dmb 后留下一个通用的 str.(对于必须使用 dmb 的情况).

This version compiles to pretty nice asm for all architectures: One missed-optimization I see is that compilers don't manage to separate the dmb from the str, leaving one common str after a conditional dmb. (For cases where they have to use dmb).

// recommended version
int foo_ifelse(int in1, int *ptr1) {
    int out1=123;
    if (in1 != 0) {
        out1 = 0;
        atomic_store_explicit((_Atomic int*)ptr1, in1, memory_order_release);
    } else {
        atomic_store_explicit((_Atomic int*)ptr1, in1, memory_order_relaxed);
    }
    return out1;
}

AArch64 gcc6.3 -O3输出():

foo_ifelse:
    cbnz    w0, .L9       @ compare-and-branch-non-zero
    str     wzr, [x1]     @ plain (relaxed) store
    mov     w0, 123
    ret
.L9:
    stlr    w0, [x1]      @ release-store
    mov     w0, 0
    ret

可以order 参数设为变量以简化源代码,但是 gcc 用它做的工作很糟糕.(clang 把它变回一个分支).GCC 将其增强为 seq_cst,即使在这种情况下只有 2 个选项是放松和释放.

You could make the order parameter a variable as a way to simplify your source, but gcc does a very bad job with it. (clang turns it back into a branch). GCC strengthens it to seq_cst, even though the only 2 options in this case are relaxed and release.

// don't do this, gcc just strengthen variable-order to seq_cst
int foo_variable_order(int in1, int *ptr1) {
    int out1=123;
    memory_order order = memory_order_relaxed;

    if (in1 != 0) {
        out1 = 0;
        order = memory_order_release;
    }
    // SLOW AND INEFFICIENT with gcc
    // but clang distributes it over the branch
    atomic_store_explicit((_Atomic int*)ptr1, in1, order);
    return out1;
}

一个非常量的order 需要在 asm 中分支,或者加强到最大.

A non-constant order requires branching in the asm, or strengthening to the maximum.

我们真的可以看到过度强化对 x86 的影响,其中 gcc 使用 mfence 来做这件事,但其他人只使用普通的 mov(在x86 汇编).同样在 ARM32 gcc 输出中,我们在 之前和 之后看到 dmb 存储,用于 seq-cst 而不是释放.

We can really see the effect of over-strengthening on x86, where gcc uses mfence for this, but only plain mov for the others (which has release semantics in x86 asm). Also in ARM32 gcc output, where we see dmb before and after the store, for seq-cst instead of just release.

@ gcc6.3 -Os -mcpu=cortex-m4 -mthumb
foo_variable_order:
    dmb     ish
    str     r0, [r1]
    dmb     ish             @ barrier after for seq-cst

    cmp     r0, #0
    ite     eq              @ branchless out1 = in1 ? 0 : 123
    moveq   r0, #123
    movne   r0, #0
    bx      lr

这篇关于可能被覆盖的值的内联汇编约束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 02:35