当双打混合使用int和无符号整型之间的速度差

当双打混合使用int和无符号整型之间的速度差

本文介绍了当双打混合使用int和无符号整型之间的速度差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序,其中内环的一部分基本上是:

I have an application where part of the inner loop was basically:

double sum = 0;
for (int i = 0; i != N; ++i, ++data, ++x) sum += *data * x;

如果x是一个unsigned int,那么code需要3倍的时间与诠释!

If x is an unsigned int, then the code takes 3 times as long as with int!

这是一个较大的code-基础的一部分,但我得到了下来的要领:

This was part of a larger code-base, but I got it down to the essentials:

#include <iostream>
#include <cstdlib>
#include <vector>
#include <time.h>

typedef unsigned char uint8;

template<typename T>
double moments(const uint8* data, int N, T wrap) {
    T pos = 0;
    double sum = 0.;
    for (int i = 0; i != N; ++i, ++data) {
        sum += *data * pos;
        ++pos;
        if (pos == wrap) pos = 0;
    }
    return sum;
}

template<typename T>
const char* name() { return "unknown"; }

template<>
const char* name<int>() { return "int"; }

template<>
const char* name<unsigned int>() { return "unsigned int"; }

const int Nr_Samples = 10 * 1000;

template<typename T>
void measure(const std::vector<uint8>& data) {
    const uint8* dataptr = &data[0];
    double moments_results[Nr_Samples];
    time_t start, end;
    time(&start);
    for (int i = 0; i != Nr_Samples; ++i) {
        moments_results[i] = moments<T>(dataptr, data.size(), 128);
    }
    time(&end);
    double avg = 0.0;
    for (int i = 0; i != Nr_Samples; ++i) avg += moments_results[i];
    avg /= Nr_Samples;
    std::cout << "With " << name<T>() << ": " << avg << " in " << (end - start) << "secs" << std::endl;
}


int main() {
    std::vector<uint8> data(128*1024);
    for (int i = 0; i != data.size(); ++i) data[i] = std::rand();
    measure<int>(data);
    measure<unsigned int>(data);
    measure<int>(data);
    return 0;
}

没有优化的编译:

Compiling with no optimization:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  test.cpp
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 9secs
With unsigned int: 1.06353e+09 in 14secs
With int: 1.06353e+09 in 9secs

通过优化:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  -O3  test.cpp
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 3secs
With unsigned int: 1.06353e+09 in 12secs
With int: 1.06353e+09 in 4secs

我不明白为什么这样的速度大的差别。我试图从生成的程序集计算出来的,但我毫无进展。任何人有什么想法?

I don't understand why such a large difference in speed. I tried figuring it out from the generated assembly, but I got nowhere. Anyone have any thoughts?

这是什么做的硬件或者是gcc的优化机械的限制?我敢打赌第二。

Is this something to do with the hardware or is it a limitation of gcc's optimisation machinery? I'm betting the second.

我的机器是运行Ubuntu 9.10。英特尔32位

My machine is an Intel 32 bit running Ubuntu 9.10.

修改:由于斯蒂芬问,这里是去编译源(从-O3编译)。我相信,我得到了主循环:

Edit: Since Stephen asked, here is the de-compiled source (from a -O3 compilation). I believe I got the main loops:

INT版本:

40: 0f b6 14 0b             movzbl (%ebx,%ecx,1),%edx
     sum += *data * pos;
44: 0f b6 d2                movzbl %dl,%edx
47: 0f af d0                imul   %eax,%edx
      ++pos;
4a: 83 c0 01                add    $0x1,%eax
      sum += *data * pos;
4d: 89 95 54 c7 fe ff       mov    %edx,-0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
53: 31 d2                   xor    %edx,%edx
55: 3d 80 00 00 00          cmp    $0x80,%eax
5a: 0f 94 c2                sete   %dl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
5d: 83 c1 01                add    $0x1,%ecx
      sum += *data * pos;
60: db 85 54 c7 fe ff       fildl  -0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
66: 83 ea 01                sub    $0x1,%edx
69: 21 d0                   and    %edx,%eax
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6b: 39 f1                   cmp    %esi,%ecx
      sum += *data * pos;
6d: de c1                   faddp  %st,%st(1)
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6f: 75 cf                   jne    40

签名版本:

50: 0f b6 34 13             movzbl (%ebx,%edx,1),%esi
      sum += *data * pos;
54: 81 e6 ff 00 00 00       and    $0xff,%esi
5a: 31 ff                   xor    %edi,%edi
5c: 0f af f0                imul   %eax,%esi
      ++pos;
5f: 83 c0 01                add    $0x1,%eax
      if (pos == wrap) pos = 0;
62: 3d 80 00 00 00          cmp    $0x80,%eax
67: 0f 94 c1                sete   %cl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6a: 83 c2 01                add    $0x1,%edx
      sum += *data * pos;
6d: 89 bd 54 c7 fe ff       mov    %edi,-0x138ac(%ebp)
73: 89 b5 50 c7 fe ff       mov    %esi,-0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
79: 89 ce                   mov    %ecx,%esi
7b: 81 e6 ff 00 00 00       and    $0xff,%esi
      sum += *data * pos;
81: df ad 50 c7 fe ff       fildll -0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
87: 83 ee 01                sub    $0x1,%esi
8a: 21 f0                   and    %esi,%eax
  for (int i = 0; i != N; ++i, ++data) {
8c: 3b 95 34 c7 fe ff       cmp    -0x138cc(%ebp),%edx
      sum += *data * pos;
92: de c1                   faddp  %st,%st(1)
  for (int i = 0; i != N; ++i, ++data) {
94: 75 ba                   jne    50

这是-O3版本,这就是为什么源代码行跳上跳下。
谢谢你。

This is the -O3 version, which is why the source lines jump up and down.Thank you.

推荐答案

下面的原因:许多常见的架构(包括x86)有一个硬件指令符号整数转换为双打,但不具有签名硬件转换翻一番,所以编译器需要合成在软件转换。此外,英特尔唯一的无符号乘法是一个全宽乘法,而签署乘法可以使用​​符号乘法指令低

Here's why: many common architectures (including x86) have a hardware instruction to convert signed int to doubles, but do not have a hardware conversion from unsigned to double, so the compiler needs to synthesize the conversion in software. Furthermore, the only unsigned multiply on Intel is a full width multiply, whereas signed multiplies can use the signed multiply low instruction.

从unsigned int类型翻一番很可能是次优的(几乎可以肯定是,因为你观察到的经济放缓的幅度),但预计为code行为要快用签字时,GCC的软件转换整数。

GCC's software conversion from unsigned int to double may very well be suboptimal (it almost certainly is, given the magnitude of the slowdown that you observed), but it is expected behavior for the code to be faster when using signed integers.

假设智能编译器,差应在64位的系统小得多,因为一个64位带符号整数。 - >双转换可以用来有效地做一个32位无符号转换

Assuming a smart compiler, the difference should be much smaller on a 64-bit system, because a 64-bit signed integer -> double conversion can be used to efficiently do a 32-bit unsigned conversion.

编辑:来说明,这一点:

sum += *data * x;

如果整型变量签字,应编译成沿着这些路线的内容:

if the integer variables are signed, should compile into something along these lines:

mov       (data),   %eax
imul      %ecx,     %eax
cvtsi2sd  %eax,     %xmm1
addsd     %xmm1,    %xmm0

在另一方面,如果整型变量是无符号, cvtsi2sd 不能用来做转换,因此需要一个软件解决办法。我希望看到这样的事情:

on the other hand, if the integer variables are unsigned, cvtsi2sd can't be used to do the conversion, so a software workaround is required. I would expect to see something like this:

    mov       (data),   %eax
    mul       %ecx            // might be slower than imul
    cvtsi2sd  %eax,     %xmm1 // convert as though signed integer
    test      %eax,     %eax  // check if high bit was set
    jge       1f              // if it was, we need to adjust the converted
    addsd     (2^32),   %xmm1 // value by adding 2^32
1:  addsd     %xmm1,    %xmm0

这将是无符号的可接受的codeGEN - >双转换;它可以很容易加重病情。

That would be "acceptable" codegen for the unsigned -> double conversion; it could easily be worse.

所有这一切都是假设浮点code一代SSE(我相信这是在Ubuntu工具的默认,但我可能是错的)。

All of this is assuming floating-point code generation to SSE (I believe this is the default on the Ubuntu tools, but I could be wrong).

这篇关于当双打混合使用int和无符号整型之间的速度差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 08:17