performance - 当写入2个缓存行的一部分时，为什么在Skylake-Xeon上`_mm_stream_si128`要比`_mm_storeu_si128`慢得多？但对Haswell的影响较小

我有看起来像这样的代码（简单的加载，修改，存储）（我对其进行了简化以使其更具可读性）：

__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
  __m128i in = _mm_loadu_si128(inptr);
  __m128i out = in; // real code does more than this, but I've simplified it
  _mm_stream_si12(outptr,out);
  inptr  += 12;
  outptr += 16;
}

与我们较新的Skylake机器相比，此代码在较旧的Sandy Bridge Haswell硬件上运行的速度大约快5倍。例如，如果while循环运行约16e9次迭代，则在Sandy Bridge Haswell上花费14秒，在Skylake上花费70秒。

我们在Skylake上升级了最新的微码，
并且还卡在vzeroupper命令中以避免任何AVX问题。两种修复均无效。

outptr对齐为16个字节，因此stream命令应写入对齐的地址。（我检查过以核实这一说法）。 inptr与设计不符。注释掉负载没有任何效果，限制命令就是存储。 outptr和inptr指向不同的存储区域，没有重叠。

如果将_mm_stream_si128替换为_mm_storeu_si128，则代码在两台计算机上的运行速度都更快，大约2.9秒。

所以这两个问题是

1）为什么在使用_mm_stream_si128内在函数进行编写时，Sandy Bridge Haswell和Skylake之间有如此大的差异？

2）为什么_mm_storeu_si128的运行速度比流传输的同类产品快5倍？

关于内在函数，我是新手。

附录-测试用例

这是整个测试用例：https://godbolt.org/z/toM2lB

这是我在两种不同的处理器（E5-2680 v3（Haswell）和8180（Skylake））上测试的基准的摘要。

// icpc -std=c++14  -msse4.2 -O3 -DNDEBUG ../mre.cpp  -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
// The command line was
//    perf stat ./mre 100000
//
//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     1.65   7.29
//   _mm_storeu_si128     0.41   0.40

流与存储的比率分别为4倍或18倍。

我依靠默认的new分配器将数据对齐为16个字节。我在这里很幸运。我已经测试了这是真的，并且在生产应用程序中，我使用对齐的分配器来确保它是对的，并检查地址，但是我将其保留在示例中，因为我认为这并不重要。

第二次编辑-64B对齐输出

@Mystical的注释使我检查了输出是否都已缓存对齐。对Tile结构的写入是在64-B块中完成的，但是Tiles本身不是64-B对齐的（仅16-B对齐的）。

所以像这样更改了我的测试代码：

#if 0
    std::vector<Tile> tiles(outputPixels/32);
#else
    std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif

现在数字大不相同了：

//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     0.19   0.48
//   _mm_storeu_si128     0.25   0.52

所以一切都快得多。但是Skylake仍然比Haswell慢2倍。

第三编辑。故意未对准

我尝试了@HaidBrais建议的测试。我特意将向量类分配为64字节对齐，然后在分配器内添加16字节或32字节，以便分配是16字节或32字节对齐，但不是64字节对齐。我还将循环数增加到1,000,000，并进行了3次测试，并选择了最小的时间。

perf stat ./mre1  1000000

重申一下，对齐2 ^ N表示它未对齐2 ^（N + 1）或2 ^（N + 2）。

//   STORER               alignment time (seconds)
//                        byte  E5-2680   8180
// ---------------------------------------------------
//   _mm_storeu_si128     16       3.15   2.69
//   _mm_storeu_si128     32       3.16   2.60
//   _mm_storeu_si128     64       1.72   1.71
//   _mm_stream_si128     16      14.31  72.14
//   _mm_stream_si128     32      14.44  72.09
//   _mm_stream_si128     64       1.43   3.38

因此很明显，缓存对齐方式可以提供最佳结果，但是_mm_stream_si128仅在2680处理器上更好，并且在8180上遭受了我无法解释的惩罚。

为了将来使用，这是我使用的未对齐分配器（我没有对未对齐进行模板化，您必须编辑32并根据需要更改为0或16）：

template <class T >
struct Mallocator {
  typedef T value_type;
    Mallocator() = default;
      template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
        T* allocate(std::size_t n) {
                if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
                    uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
                    if(! p1) throw std::bad_alloc();
                    p1 += 32; // misalign on purpose
                    return reinterpret_cast<T*>(p1);
                          }
          void deallocate(T* p, std::size_t) noexcept {
              uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
              p1 -= 32;
              std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }

...

std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

最佳答案

简化的代码并没有真正显示基准的实际结构。我认为简化的代码不会表现出您提到的缓慢性。

来自Godbolt代码的实际循环为：

while (count > 0)
        {
            // std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
            __m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
            __m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
            __m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
            __m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));

            __m128i tileVal0 = value0;
            __m128i tileVal1 = value1;
            __m128i tileVal2 = value2;
            __m128i tileVal3 = value3;

            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);

            ptr    += diffBytes * 4;
            count  -= diffBytes * 4;
            tile   += diffPixels * 4;
            ipixel += diffPixels * 4;
            if (ipixel == 32)
            {
                // go to next tile
                ipixel = 0;
                tileIter++;
                tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
            }
        }

注意if (ipixel == 32)部分。每当ipixel达到32时，它就会跳转到另一个图块。由于diffPixels为8，所以每次迭代都会发生一次。因此，每个图块仅创建4个流存储（64字节）。除非每个图块碰巧都是64字节对齐的（这不太可能是偶然发生的，并且不能依靠），否则这意味着每次写入仅写入两个不同高速缓存行的一部分。这是流存储的已知反模式：为了有效使用流存储，您需要写出完整的行。

性能差异：流存储在不同硬件上的性能差异很大。这些存储区总是会占用行填充缓冲区一段时间，但会持续多长时间：在许多客户端芯片上，它似乎只占用大约L3延迟的缓冲区。即，一旦流存储到达L3，就可以将其移交（L3将跟踪其余工作），并且可以在核心上释放LFB。服务器芯片通常具有更长的延迟。特别是多路主机。

显然，NT存储的性能在SKX盒上较差，而对于部分行写入则更差。总体上较差的性能可能与L3缓存的重新设计有关。

skylake

performance - 当写入2个缓存行的一部分时，为什么在Skylake-Xeon上`_mm_stream_si128`要比`_mm_storeu_si128`慢得多？但对Haswell的影响较小