


对于以下功能,具有优化功能的代码被矢量化并在寄存器中执行计算(返回值在 eax 中返回)。生成的机器代码例如在这里:。

For the following function, the code with optimizations is vectorized and the computation is performed in registers (the return value is returned in eax). Generated machine code is, e.g., here: https://godbolt.org/z/VQEBV4.

int sum(int *arr, int n) {
  int ret = 0;
  for (int i = 0; i < n; i++)
    ret += arr[i];
  return ret;

但是,如果我做 ret 全局变量(或类型为 int& 的参数),不使用向量化,并且编译器存储更新的 ret 每次迭代到内存。机器码:。

However, if I make ret variable global (or, a parameter of type int&), the vectorization is not used and the compiler stores the updated ret in each iteration to memory. Machine code: https://godbolt.org/z/NAmX4t.

int ret = 0;

int sum(int *arr, int n) {
  for (int i = 0; i < n; i++)
    ret += arr[i];
  return ret;


I don't understand why the optimizations (vectorization/computations in registers) are prevented in the latter case. There is no threading, even the increments are not performed atomically. Moreover, this behavior seems to be consistent across compilers (GCC, Clang, Intel), so I believe there must be some reason for it.


如果 ret 不是本地的而是全局的,则 arr 可能会别名为 ret ,从而减少了进行优化的机会。

If ret is not local but global, arr might alias to ret reducing opportunity to optimize.



09-06 10:13