c++ - 64位平台上的效率: pointer vs 32-bit array indexing

在他的主题演讲之一中，Andrei Alexandrescu建议，在64位平台上，使用32位数组索引比使用原始指针快:

第16页:http://www.slideshare.net/andreialexandrescu1/three-optimization-tips-for-c-15708507

在他的Facebook帐户上，他更加精确地说:“更喜欢将数组索引设置为指针(此指针似乎每十年就会反转一次)”。

我已经尝试了很多方法来找到不同之处，但是我还没有设法构建任何能够显示出这种不同之处的程序。认识安德烈(Andrei)，差异不会超过百分之几，我不会感到惊讶，但是如果有人找到这样的例子，我将感到高兴。

这是我做过的测试。我选择n = 5000，足够大以获得适当的计时，并且足够小以使所有内容都适合L1缓存。我循环几次，以使CPU频率上升。

#include <iostream>
#include <chrono>

int main(int argc, const char* argv[]) {
  const int n{5000};
  int* p{new int[n]};

  // Warm up the cache
  for (int i{0}; i < n; i++) {
    p[i] += 1;
  }

  for (int j{0}; j < 5; j++) {
    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_pointer = std::chrono::high_resolution_clock::now();
      for (int* q{p}; q != p + n; ++q) {
        ++(*q);
      }
      auto end_pointer = std::chrono::high_resolution_clock::now();
      auto time_pointer = std::chrono::duration_cast<std::chrono::nanoseconds>(
                              end_pointer - start_pointer)
                              .count();
      std::cout << " Pointer: " << time_pointer << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      auto start_index_32 = std::chrono::high_resolution_clock::now();
      for (int i{0}; i < n; i++) {
        p[i] += 1;
      }
      auto end_index_32 = std::chrono::high_resolution_clock::now();
      auto time_index_32 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_32 - start_index_32)
                               .count();
      std::cout << "Index 32: " << time_index_32 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }

    {
      const std::size_t n_64{n};
      auto start_index_64 = std::chrono::high_resolution_clock::now();
      for (std::size_t i{0}; i < n_64; i++) {
        p[i] += 1;
      }
      auto end_index_64 = std::chrono::high_resolution_clock::now();
      auto time_index_64 = std::chrono::duration_cast<std::chrono::nanoseconds>(
                               end_index_64 - start_index_64)
                               .count();
      std::cout << "Index 64: " << time_index_64 << std::endl;
    }
    std::cout << std::endl;
  }

  delete[] p;

  return 0;
}

这是我在计算机(核心i7)上得到的那种结果。我得到的只是“噪音”。

 Pointer: 883
 Pointer: 485
Index 32: 436
Index 32: 380
Index 64: 372
Index 64: 429

 Pointer: 330
 Pointer: 316
Index 32: 336
Index 32: 321
Index 64: 337
Index 64: 318

 Pointer: 311
 Pointer: 314
Index 32: 318
Index 32: 319
Index 64: 316
Index 64: 301

 Pointer: 306
 Pointer: 325
Index 32: 323
Index 32: 313
Index 64: 318
Index 64: 305

 Pointer: 311
 Pointer: 319
Index 32: 313
Index 32: 324
Index 64: 315
Index 64: 303

最佳答案

这样的低级建议(甚至来自Andrei Alexandrescu)的问题在于，它忽略了编译器优化的事实。

现代的编译器如此积极地进行优化(并且通常会成功地进行优化)，以至于试图对其进行第二次猜测就成了杯子的游戏。总体而言，编写清晰易读的代码将帮助您，您的同事和编译器分析代码。老实说，我认为这是可以提供的最佳一般建议。

现代编译器使用的著名优化之一是基于索引和基于指针的循环之间的转换。在基准测试的特定情况下，对于大多数优化设置，gcc会将基于指针的循环和基于32位索引的循环编译为相同的汇编器输出。

在下文中，我将chrono内容替换为++sentry，其中sentry是volatile，以减小代码大小。程序集输出对应于:

for (int* q{p}; q != p + n; ++q) ++(*q);
++sentry;
for (int i{0}; i < n; i++) p[i] += 1;

使用-O2进行编译，将产生以下结果:(%rdi和%ebp仍从填充p的循环中进行了初始化)

        movq    %rdi, %rdx
        cmpq    %rcx, %rdi
        je      .L10
.L16:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rcx, %rdx
        jne     .L16
.L10:
        movl    sentry(%rip), %eax
        movq    %rdi, %rdx
        addl    $1, %eax
        movl    %eax, sentry(%rip)
        testl   %ebp, %ebp
        jle     .L8
.L14:
        addl    $1, (%rdx)
        addq    $4, %rdx
        cmpq    %rdx, %rsi
        jne     .L14
.L8:

您可以看到.L16和.L14的循环之间没有任何区别。

当然，不同的优化设置会产生不同的结果。使用-O3，可以使用SIMD指令和Duff的设备对循环进行矢量化处理，但它们几乎是相同的。 clang在-O2上进行了此优化

这些都不能否认这一点，即编译器可能需要更努力地工作，以证明通过其写入的指针不能修改任意内存。

但是在这种情况下，就像在许多情况下一样，循环索引是一个局部变量，并且循环非常简单，编译器可以对其进行完全分析，从而可以降低强度，展开和矢量化。那么控制变量是指针还是索引都是无关紧要的。

一个更有趣的示例(可能)是两个数组的循环，其中基本元素的大小不同。给定以下两个功能:

void d2f_ptr(float* out, const double* in, int n) {
  for (auto lim = out + n; out < lim;) *out++ = *in++;
}
void d2f_idx(float out[], const double in[], int n) {
  for (int i = 0; i < n; ++i) out[i] = in[i];
}

gcc(v5.3.0，-O2)确实会产生不同的循环，而基于索引的循环则短了一条指令:

d2f_ptr(float*, double const*, int):    d2f_idx(float*, double const*, int):
        movslq  %edx, %rdx                      xorl    %eax, %eax
        leaq    (%rdi,%rdx,4), %rax             testl   %edx, %edx
        cmpq    %rax, %rdi                      jle     .L16
        jnb     .L11
.L15:                                   .L20:
        addq    $4, %rdi                        pxor    %xmm0, %xmm0
        addq    $8, %rsi                        cvtsd2ss (%rsi,%rax,8), %xmm0
        pxor    %xmm0, %xmm0                    movss   %xmm0, (%rdi,%rax,4)
        cvtsd2ss -8(%rsi), %xmm0                addq    $1, %rax
        movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpl    %eax, %edx
        ja      .L15                            jg      .L20
.L11:                                   .L16:
        ret                                     ret

但是将double和float更改为大小不再允许使用Intel芯片的索引寻址模式的对象，然后编译器再次将基于索引的代码转换为基于指针的变体。

此处的代码与以前基本相同，但 double 已填充为48个字节:

struct Big { double val; char padding[40]; };
struct Small {
  float val;
  Small& operator=(const Big& other) {
    val = other.val;
    return *this;
  }
};

d2f_ptr(Small*, Big const*, int):       d2f_idx(Small*, Big const*, int):
        movslq  %edx, %rdx                      testl   %edx, %edx
        leaq    (%rdi,%rdx,4), %rax             jle     .L26
        cmpq    %rax, %rdi                      leal    -1(%rdx), %eax
        jnb     .L21                            leaq    4(%rdi,%rax,4), %rax
.L25:                                   .L29:
        addq    $48, %rsi                       pxor    %xmm0, %xmm0
        addq    $4, %rdi                        addq    $4, %rdi
        pxor    %xmm0, %xmm0                    cvtsd2ss (%rsi), %xmm0
        cvtsd2ss -48(%rsi), %xmm0               addq    $48, %rsi
        movss   %xmm0, -4(%rdi)                 movss   %xmm0, -4(%rdi)
        cmpq    %rdi, %rax                      cmpq    %rax, %rdi
        ja      .L25                            jne     .L29
.L21:                                   .L26:
        ret                                     ret

值得一提的是，对于编译器而言，分析特定指针写入将修改哪个对象并不一定会更加困难。 [编辑过的:这里有Alexandrescu的一句话，但它不如我想的那么重要，因此我删除了它，使本节的内容大体上是稻草人。]

实际上，如果仅将一个指针直接分配给一次，并且所有其他修改都是通过递增和递减操作(包括+=和-=)，则编译器完全有权假设该指针始终指向同一对象。如果对指针进行了一些附加的修改以使它超出其他对象，则将是“未定义行为”，编译器可以放弃这种可能性。跟踪流程图中的assign和inc/dec操作非常容易，因此在指针可能已被索引表达式替换的情况下，编译器很可能会发现该错误，从而知道其他对象不是通过指针进行随机变异。