本文介绍了生产与铿锵随身code加好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图产生code(目前使用铛++ - 3.8),增加了两个数字组成的多机的话。为了简化事情的那一刻,我只需要添加128位数字,但我希望能概括这一点。

I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.

首先,一些类型定义:

typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;

和一个结果类型:

struct Result
{
  unsigned_word lo;
  unsigned_word hi;
};

第一个函数,˚F,采用两个成对的无符号字,并返回一个结果,通过作为中间步骤把两个这些64位字到一个128位的字加入他们,像这样前:

The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:

Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
  unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
  unsigned_128 r1 = n1 + n2;
  x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
  x.hi = r1 >> 64;
  return x;
}

这实际上是被内联的相当不错,像这样:

This actually gets inlined quite nicely like so:

movq    8(%rsp), %rsi
movq    (%rsp), %rbx
addq    24(%rsp), %rsi
adcq    16(%rsp), %rbx

现在,而不是我用铿锵多precision primatives写了一个简单的功能,如下图所示:

Now, instead I've written a simpler function using the clang multi-precision primatives, as below:

static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
  return x;
}

这将产生以下组件:

movq    24(%rsp), %rsi
movq    (%rsp), %rbx
addq    16(%rsp), %rbx
addq    8(%rsp), %rsi
adcq    $0, %rbx

在这种情况下,有一个额外的附加。而不是做一个普通的添加的LO-话,那么 ADC 的喜字,它只是添加 S上的喜字,那么添加 S上的罗字,然后执行一个 ADC 的喜字再次参数0。

In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.

这可能看起来还不错,但是当你尝试这个具有较大的话(说为192bit,256bit的),你很快就会和其它指令处理的一团糟在承载了链,而不是添加 ADC ADC ... ADC

This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.

多precision元似乎是在做他们打算做什么可怕的工作。

The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.

所以,我正在寻找的是code,我可以推广到任意长度(无需做,刚好够这样我就可以工作,如何),这铛产生附加在方式与作为有效,因为它的功能与它内置的128位类型(很遗憾,我不能轻易一概而论)。我presume这应该只是一个链 ADC 的S,但我欢迎参数和code,它应该是别的东西。

So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.

推荐答案

有一个内在的做到这一点: _ addcarry_u64 。然而,只有Visual工作室和(至少VS 2013和2015年和ICC 13和ICC 15)有效地做到这一点。铛3.7和5.2 GCC还没有产生有效的code。与这种内在。

There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.

锵除了拥有一个内置的哪一个会想到这样做, __ builtin_addcll ,但它并没有任何生产效率code。

Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.

原因的Visual Studio确实这是它不允许在64位模式内联汇编所以编译器应该提供一种方法与内在做到这一点(虽然微软采取了他们的时间实现这一点)。

The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).

因此​​,与Visual Studio使用 _addcarry_u64 。随着ICC使用 _addcarry_u64 或内嵌汇编。随着锵和GCC使用内联汇编。

Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.

请注意,由于Broadwell微架构的微架构有两个新的指令:使用ADCx ADOX ,您可以用访问 _ addcarryx_u64 内在。英特尔的这些内在文档中使用是different然后但似乎他们的文档现在是正确的编译器生成的汇编。但是,Visual Studio中仍然只出现生产使用ADCx _addcarryx_u64 而ICC既产生使用ADCx ADOX 这个内在的。但是,即使ICC产生两个指令也不会产生最优化的code(ICC 15)等内联汇编仍然是必要的。

Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.

我个人认为,C / C ++的非标准功能,如内联汇编或内联函数,需要做到这一点其实是C / C的弱点++但其他人可能不同意。在 ADC 指令已经自1979年x86指令集,我不会抱我的C / C ++编译器能够优化搞清楚当你想一口气 ADC 。当然,他们可以有内置的类型,如 __ int128 但你想,这不是内置的你更大的类型此刻必须使用一些非标准的C / C ++的功能,例如内联汇编或内部函数。

Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.

在内嵌汇编code而言要做到这一点,我已经贴了256位加法的寄存器8个64位整数处的多字加法。

In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.

下面是code转载。

#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
 __asm__ __volatile__ ( \
 "addq %[v1], %[u1] \n" \
 "adcq %[v2], %[u2] \n" \
 "adcq %[v3], %[u3] \n" \
 "adcq %[v4], %[u4] \n" \
 : [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
 : [v1]  "r" (Y1), [v2]  "r" (Y2), [v3]  "r" (Y3), [v4]  "r" (Y4))

如果你想明确地从内存中加载的值,你​​可以做这样的事情。

If you want to explicitly load the values from memory you can do something like this

//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
     "movq (%[in]), %%rax\n"
     "addq %%rax, %[out]\n"
     "movq 8(%[in]), %%rax\n"
     "adcq %%rax, 8%[out]\n"
     "movq 16(%[in]), %%rax\n"
     "adcq %%rax, 16%[out]\n"
     "movq 24(%[in]), %%rax\n"
     "adcq %%rax, 24%[out]\n"
     : [out] "=m" (dst)
     : [in]"r" (src)
     : "%rax"
     );

这产生nearlly相同的组件,从IC卡中的以下功能

That produces nearlly identical assembly as from the following function in ICC

void add256(uint256 *x, uint256 *y) {
    unsigned char c = 0;
    c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
    c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
    c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
        _addcarry_u64(c, x->x4, y->x4, &x->x4);
}

我有限的与海湾合作委员会内联汇编的经验。(或内联汇编一般 - 我通常用汇编,如NASM),所以也许有更好的内联汇编的解决方案

I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.

所以,我正在寻找的是code,我可以推广到任意长度

要在这里回答这个问题是使用模板元编程另一种解决方案。 I用于循环这个同样的伎俩展开。这将产生最佳code与ICC。如果锵或GCC曾经实施 _addcarry_u64 有效,这将是一个很好的通用解决方案。

To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.

#include <x86intrin.h>
#include <inttypes.h>

#define LEN 4  // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...

static unsigned char c = 0;

template<int START, int N>
struct Repeat {
    static void add (uint64_t *x, uint64_t *y) {
        c = _addcarry_u64(c, x[START], y[START], &x[START]);
        Repeat<START+1, N>::add(x,y);
    }
};

template<int N>
    struct Repeat<LEN, N> {
    static void add (uint64_t *x, uint64_t *y) {}
};


void sum_unroll(uint64_t *x, uint64_t *y) {
    Repeat<0,LEN>::add(x,y);
}

从ICC大会

xorl      %r10d, %r10d                                  #12.13
movzbl    c(%rip), %eax                                 #12.13
cmpl      %eax, %r10d                                   #12.13
movq      (%rsi), %rdx                                  #12.13
adcq      %rdx, (%rdi)                                  #12.13
movq      8(%rsi), %rcx                                 #12.13
adcq      %rcx, 8(%rdi)                                 #12.13
movq      16(%rsi), %r8                                 #12.13
adcq      %r8, 16(%rdi)                                 #12.13
movq      24(%rsi), %r9                                 #12.13
adcq      %r9, 24(%rdi)                                 #12.13
setb      %r10b

元编程是装配的基本特征,所以它是太糟糕了C和C ++(除非通过模板元编程黑客)都为这个无解的是(D语言一样)。

Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).

我使用的内嵌汇编上述其中引用的内存是导致在函数的一些问题。下面是这似乎更好地工作,一个新的版本。

The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better

void foo(uint64_t *dst, uint64_t *src)
{
    __asm (
        "movq (%[in]), %%rax\n"
        "addq %%rax, (%[out])\n"
        "movq 8(%[in]), %%rax\n"
        "adcq %%rax, 8(%[out])\n"
        "movq 16(%[in]), %%rax\n"
        "addq %%rax, 16(%[out])\n"
        "movq 24(%[in]), %%rax\n"
        "adcq %%rax, 24(%[out])\n"
        :
        : [in] "r" (src), [out] "r" (dst)
        : "%rax"
    );
}

这篇关于生产与铿锵随身code加好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 18:51