本文介绍了有什么方法可以在Linux平台上编译Microsoft风格的内联汇编代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如标题中所述,我想知道有什么方法可以在Linux OS(例如ubuntu)中编译Microsoft样式的内联汇编代码(如下所示).

As mentioned in title, i'm wondering that is there any way to compile a microsoft style inline-assembly code (as showed below) in a linux OS (e.g. ubuntu).

_asm{
    mov edi, A;
    ....
    EMMS;
}

该示例代码是行内汇编代码的一部分,可以使用cl.exe编译器在win10上成功编译该代码.有什么办法可以在Linux上编译它吗?我是否必须以GNU c/c ++样式(即__asm __ {;;;})重写它?

The sample code is part of a inline-assembly code which can be compiled successfully on win10 with cl.exe compiler. Is there any way to compile it on linux? Do i have to rewrite it in GNU c/c++ style (i.e. __asm__{;;;})?

推荐答案

首先,通常应替换内联asm(使用内部函数或纯C),而不要移植它. https://gcc.gnu.org/wiki/DontUseInlineAsm

First of all, you should usually replace inline asm (with intrinsics or pure C) instead of porting it. https://gcc.gnu.org/wiki/DontUseInlineAsm

clang -fasm-blocks 与MSVC效率低下的嵌入式asm语法兼容.但是它不支持通过将值保留在EAX中然后从非空函数的结尾掉下来来返回值.

clang -fasm-blocks is mostly compatible with MSVC's inefficient inline asm syntax. But it doesn't support returning a value by leaving it in EAX and then falling off the end of a non-void function.

因此,您必须编写内联asm,将值放在命名的C变量中,然后 ,通常会导致额外的存储/重载,从而使MSVC语法更糟.(非常糟糕,除非您要在asm中编写一个完整的循环,以分摊将数据存入/退出asm块的存储/重载开销).参见"asm","__ asm"和""__asm__"?,用于比较包装单个指令时MSVC inline-asm的效率如何.当这些函数不内联时,带有堆栈args的函数在内部不会那么笨拙,但这仅在您已经使事情效率低下时才会发生(例如,使用传统的32位调用约定,而不是使用链接时优化来内联小函数).

So you have to write inline asm that puts the value in a named C variable and return that, typically leading to an extra store/reload making MSVC syntax even worse. (Pretty bad unless you're writing a whole loop in asm that amortizes that store/reload overhead of getting data into / out of the asm block). See What is the difference between 'asm', '__asm' and '__asm__'? for a comparison of how inefficient MSVC inline-asm is when wrapping a single instruction. It's less dumb inside functions with stack args when those functions don't inline, but that only happens if you're already making things inefficient (e.g. using legacy 32-bit calling conventions and not using link-time optimization to inline small functions).

MSVC可以用立即 1 替换 A ,但是clang不能.两者都使常量传播无效,但MSVC至少避免了通过存储/重载来反弹常量输入.(只要您仅将其与可以支持直接源操作数的指令一起使用即可.)

MSVC can substitute A with an immediate 1 when inlining into a caller, but clang can't. Both defeat constant-propagation but MSVC at least avoids bouncing constant inputs through a store/reload. (As long as you only use it with instructions that can support an immediate source operand.)

Clang接受 __ asm asm __ asm __ 来引入一个asm块.MSVC接受 __ asm (2个下划线,如clang)或 _asm (更常用,但clang不接受).

Clang accepts __asm, asm, or __asm__ to introduce an asm-block. MSVC accepts __asm (2 underscores like clang) or _asm (more commonly used, but clang doesn't accept it).

因此,对于现有的MSVC代码,您可能需要 #define _asm __asm ,以便您的代码可以同时使用MSVC和clang进行编译,除非您始终需要制作单独的版本.或使用 clang -D_asm = asm 在命令行上设置CPP宏.

So for existing MSVC code you probably want #define _asm __asm so your code can compile with both MSVC and clang, unless you need to make separate versions anyway. Or use clang -D_asm=asm to set a CPP macro on the command line.

(不要忘记启用优化: clang -fasm-blocks -O3 -march = native -flto -Wall .如果有,请忽略或修改 -march = native 您想要一个可以在比编译主机更早/其他CPU上运行的二进制文件.)

(Don't forget to enable optimization: clang -fasm-blocks -O3 -march=native -flto -Wall. Omit or modify -march=native if you want a binary that can run on earlier/other CPUs than your compile host.)

int a_global;

inline
long foo(int A, int B, int *arr) {
    int out;
    // You can't assume A will be in RDI: after inlining it prob. won't be
    __asm {
        mov   ecx, A                   // comment syntax
        add   dword ptr [a_global], 1
        mov   out, ecx
    }
    return out;
}

编译与显示了clang可以内联包含inline-asm的包装函数,以及需要存储/重新加载MSVC语法的数量(vs.GNU C内联汇编,可以在寄存器中接受输入和输出.

Compiling with x86-64 Linux clang 8.0 on Godbolt shows that clang can inline the wrapper function containing the inline-asm, and how much store/reload MSVC syntax entails (vs. GNU C inline asm which can take inputs and outputs in registers).

我在Intel语法asm输出模式下使用clang,但是当它在AT& T语法模式下输出时,它也会编译Intel语法asm块.(通常,无论如何,clang都直接编译为机器代码,它也可以正确执行.)

I'm using clang in Intel-syntax asm output mode, but it also compiles Intel-syntax asm blocks when it's outputting in AT&T syntax mode. (Normally clang compiles straight to machine-code anyway, which it also does correctly.)

## The x86-64 System V ABI passes args in rdi, rsi, rdx, ...
# clang -O3 -fasm-blocks -Wall
foo(int, int, int*):
        mov     dword ptr [rsp - 4], edi        # compiler-generated store of register arg to the stack

        mov     ecx, dword ptr [rsp - 4]        # start of inline asm
        add     dword ptr [rip + a_global], 1
        mov     dword ptr [rsp - 8], ecx        # end of inline asm

        movsxd  rax, dword ptr [rsp - 8]        # reload `out` with sign-extension to long (64-bit) : compiler-generated
        ret

注意编译器如何用 [rsp-4] [rsp-8] 代替C局部变量 A out 在asm源代码块中.而且静态存储中的变量会获得相对RIP的寻址.GNU C内联asm不执行此操作,您需要声明%[name] 操作数,并告诉编译器将其放置在何处.

Notice how the compiler substituted [rsp - 4] and [rsp - 8] for the C local variables A and out in the asm source block. And that a variable in static storage gets RIP-relative addressing. GNU C inline asm doesn't do this, you need to declare %[name] operands and tell the compiler where to put them.

我们甚至可以看到clang内联函数两次调用一个调用程序,并将符号扩展优化为64位,因为此函数仅返回 int .

We can even see clang inline that function twice into one caller, and optimize away the sign-extension to 64-bit because this function only returns int.

int caller() {
    return foo(1, 2, nullptr) + foo(1, 2, nullptr);
}
caller():                             # @caller()
        mov     dword ptr [rsp - 4], 1

        mov     ecx, dword ptr [rsp - 4]      # first inline asm
        add     dword ptr [rip + a_global], 1
        mov     dword ptr [rsp - 8], ecx

        mov     eax, dword ptr [rsp - 8]     # compiler-generated reload
        mov     dword ptr [rsp - 4], 1       # and store of A=1 again

        mov     ecx, dword ptr [rsp - 4]      # second inline asm
        add     dword ptr [rip + a_global], 1
        mov     dword ptr [rsp - 8], ecx

        add     eax, dword ptr [rsp - 8]     # compiler-generated reload
        ret

因此我们可以看到,仅从内联asm读取 A 会导致优化遗漏:即使asm只读取了该输入而不修改,编译器仍再次存储了 1 它.

So we can see that just reading A from inline asm creates a missed-optimization: the compiler stores a 1 again even though the asm only read that input without modifying it.

我没有做过像在asm语句之前/之间/之后分配或读取 a_global 这样的测试,以确保编译器知道"该变量已被asm语句修改.

I haven't done tests like assigning to or reading a_global before/between/after the asm statements to make sure the compiler "knows" that variable is modified by the asm statement.

我还没有测试过将指针传递到asm块中并遍历指向的数组,以查看它是否像GNU C内联asm中的内存" 破坏器.我以为是这样.

I also haven't tested passing a pointer into an asm block and looping over the pointed-to array, to see if it's like a "memory" clobber in GNU C inline asm. I'd assume it is.

我的Godbolt链接还包括一个使用EAX中的值使非空函数结束的示例.这是MSVC支持的,但是像通常的UB一样是UB,当内联到调用方时会中断.(奇怪的是,即使在 -Wall ,也没有警告).您可以在上方的我的Godbolt链接上看到x86 MSVC是如何编译的.

My Godbolt link also includes an example of falling off the end of a non-void function with a value in EAX. That's supported by MSVC, but is UB like usual for clang and breaks when inlining into a caller. (Strangely with no warning, even at -Wall). You can see how x86 MSVC compiles it on my Godbolt link above.

将MSVC组件移植到GNU C内联组件几乎肯定是错误的选择.编译器对优化内在函数的支持非常好,因此通常您可以让编译器为您生成优质高效的asm.

Porting MSVC asm to GNU C inline asm is almost certainly the wrong choice. Compiler support for optimizing intrinsics is very good, so you can usually get the compiler to generate good-quality efficient asm for you.

如果您要对现有的手写asm做任何事情,通常用纯C替换它们将是最有效的方法,并且肯定是最有前途的发展之路.将来可以自动矢量化为更广泛矢量的代码始终是好的.但是,如果您确实需要手动向量化以进行一些棘手的改组,那么除非编译器以某种方式弄乱了它,否则就必须采用固有的方法.

If you're going to do anything to existing hand-written asm, usually replacing them with pure C will be most efficient, and certainly the most future-proof, path forward. Code that can auto-vectorize to wider vectors in the future is always good. But if you do need to manually vectorize for some tricky shuffling, then intriniscs are the way to go unless the compiler makes a mess of it somehow.

请查看您从内部函数获得的由编译器生成的asm,以确保它与原始函数相同或更好.

Look at the compiler-generated asm you get from intrinsics to make sure it's as good or better than the original.

如果您使用的是MMX EMMS ,现在可能是用SSE2内在函数替换MMX代码的好时机.SSE2是x86-64的基准,并且很少有Linux系统运行过时的32位内核.

If you're using MMX EMMS, now is probably a good time to replace your MMX code with SSE2 intrinsics. SSE2 is baseline for x86-64, and few Linux systems are running obsolete 32-bit kernels.

这篇关于有什么方法可以在Linux平台上编译Microsoft风格的内联汇编代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 07:01