本文介绍了简单阵列处理循环的 AVX 512 与 AVX2 性能对比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在对 DSP 应用程序进行一些优化和比较矢量化可能性,这对于 AVX512 来说似乎是理想的,因为这些只是简单的不相关数组处理循环.但是在新的 i9 上,与 AVX2 相比,我在使用 AVX512 时没有测量出任何合理的改进.任何指针?有什么好的结果吗?(顺便说一句.我试过 MSVC/CLANG/ICL,没有明显区别,很多时候 AVX512 代码实际上看起来更慢)

I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any pointers? Any good results? (btw. I tried MSVC/CLANG/ICL, no noticeable difference, many times AVX512 code actually seems slower)

推荐答案

这看起来太宽泛了,但实际上有一些微架构细节值得一提.

This seems too broad, but there are actually some microarchitectural details worth mentioning.

请注意,AVX512-VL(向量长度)允许您在 128 上使用新的 AVX512 指令(如打包 uint64_t <-> double 转换、掩码寄存器等)和 256 位向量.现代编译器通常在调整 Skylake-AVX512(又名 Skylake-X)时使用 256 位向量自动向量化.例如gcc -march=nativegcc -march=skylake-avx512,除非您覆盖调整选项以将首选矢量宽度设置为 512 以用于权衡值得权衡的代码.请参阅@zam 的回答.

Note that AVX512-VL (Vector Length) lets you use new AVX512 instructions (like packed uint64_t <-> double conversion, mask registers, etc) on 128 and 256-bit vectors. Modern compilers typically auto-vectorize with 256-bit vectors when tuning for Skylake-AVX512, aka Skylake-X. e.g. gcc -march=native or gcc -march=skylake-avx512, unless you override the tuning options to set the preferred vector width to 512 for code where the tradeoffs are worth it. See @zam's answer.

Skylake-X 上的 512 位向量(不是 256 位的 AVX512 指令,如 vpxord ymm30、ymm29、ymm10)的一些主要内容是:

Some major things with 512-bit vectors (not 256-bit with AVX512 instruction like vpxord ymm30, ymm29, ymm10) on Skylake-X are:

  • 将数据与向量宽度对齐比使用 AVX2 更重要(每个未对齐的负载跨越缓存线边界,而不是在循环数组时每隔一个).在实践中,它产生了更大的差异.我完全忘记了我不久前测试过的确切结果,但可能会降低 20% 的速度,而由于未对齐而导致的速度低于 5%.

  • Aligning your data to the vector width is more important than with AVX2 (every unaligned load crosses a cache-line boundary, instead of every other while looping over an array). In practice it makes a bigger difference. I totally forget the exact results of something I tested a while ago, but maybe 20% slowdown vs. under 5% from misalignment.

运行 512 位 uops 会关闭端口 1 上的向量 ALU.(但不是端口 1 上的整数执行单元).一些 Skylake-X CPU(例如 Xeon Bronze)每个时钟只有 1 个 512 位 FMA 吞吐量,但 i7/i9 Skylake-X CPU 和更高端的 Xeons 在端口 5 上有一个额外的 512 位 FMA 单元,用于供电支持 AVX512模式".

Running 512-bit uops shuts down the vector ALU on port 1. (But not the integer execution units on port 1). Some Skylake-X CPUs (e.g. Xeon Bronze) only have 1 per clock 512-bit FMA throughput, but i7 / i9 Skylake-X CPUs, and the higher-end Xeons, have an extra 512-bit FMA unit on port 5 that powers up for AVX512 "mode".

因此请做出相应的计划:从扩展到 AVX512 的速度不会翻倍,您的代码中的瓶颈现在可能在后端.

So plan accordingly: you won't get double speed from widening to AVX512, and the bottleneck in your code might now be in the back-end.

运行 512 位 uops 也会限制您的最大 Turbo,因此挂钟加速可能低于核心时钟周期加速.Turbo 降低有两个级别:任何 512 位操作,然后是 512 位,如持续 FMA.

Running 512-bit uops also limits your max Turbo, so wall-clock speedups can be lower than core-clock-cycle speedups. There are two levels of Turbo reduction: any 512-bit operation at all, and then heavy 512-bit, like sustained FMAs.

vsqrtps/pd zmmvdivps/pd的FP分割执行单元不是全宽;它只有 128 位宽,因此 div/sqrt 与乘法吞吐量的比率差了大约 2 倍.参见 浮点除法与浮点乘法.vsqrtps xmm/ymm/zmm 的 SKX 吞吐量是每 3/6/12 个周期一个.double-precision 是相同的比率,但吞吐量和延迟更差.

The FP divide execution unit for vsqrtps/pd zmm and vdivps/pd is not full width; it's only 128-bit wide, so the ratio of div/sqrt vs. multiply throughput is worse by about another factor of 2. See Floating point division vs floating point multiplication. SKX throughput for vsqrtps xmm/ymm/zmm is one per 3/6/12 cycles. double-precision is the same ratios but worse throughput and latency.

高达 256 位 YMM 向量,延迟与 XMM 相同(sqrt 为 12 个周期),但对于 512 位 ZMM,延迟高达 20 个周期,并且需要 3 个 uops.(https://agner.org/optimize/ 用于说明表.)

Up to 256-bit YMM vectors, the latency is the same as XMM (12 cycles for sqrt), but for 512-bit ZMM the latency goes up to 20 cycles, and it takes 3 uops. (https://agner.org/optimize/ for instruction tables.)

如果您在除法器上遇到瓶颈并且无法在混合中获得更多其他指令,即使您需要牛顿迭代以获得足够的精度,VRSQRT14PS 也值得考虑.但请注意,AVX512 的近似 1/sqrt(x) 确实比 AVX/SSE 具有更多的保证精度位.)

If you bottleneck on the divider and can't get more other instructions in the mix, VRSQRT14PS is worth considering even if you need a Newton iteration to get enough precision. But note that AVX512's approximate 1/sqrt(x) does have more guaranteed-accuracy bits than AVX/SSE.)

就自动向量化而言,如果需要任何洗牌,编译器可能会在处理更宽的向量时做得更糟.对于简单的纯垂直的东西,编译器可以用 AVX512 来做.

As far as auto-vectorization, if there are any shuffles required, compilers might do a worse job with wider vectors. For simple pure-vertical stuff, compilers can do ok with AVX512.

您之前的问题有一个 sin 函数,也许如果编译器/SIMD 数学库只有 256 位版本,它就不会使用 AVX512 自动矢量化.

Your previous question had a sin function, and maybe if the compiler / SIMD math library only has a 256-bit version of that it won't auto-vectorize with AVX512.

如果 AVX512 没有帮助,可能您的内存带宽遇到瓶颈.使用性能计数器进行分析并找出答案.或者尝试多次重复较小的缓冲区大小,看看当您的数据在缓存中很热时它是否会显着加快速度.如果是这样,请尝试缓存阻止您的代码,或通过一次性执行更多数据来增加计算强度.

If AVX512 doesn't help, maybe you're bottlenecked on memory bandwidth. Profile with performance counters and find out. Or try more repeats of smaller buffer sizes and see if it speeds up significantly when your data is hot in cache. If so, try to cache-block your code, or increase computational intensity by doing more in one pass over the data.

AVX512 在 i9(以及整数乘法,以及在同一执行单元上运行的许多其他事物)上实现了两倍的理论最大 FMA 吞吐量,从而使 DRAM 和执行单元之间的不匹配增加一倍.因此,更好地利用 L2/L1d 缓存可以获得两倍的收益.

AVX512 does double theoretical max FMA throughput on an i9 (and integer multiply, and many other things that run on the same execution unit), making the mismatch between DRAM and execution units twice as big. So there's twice as much to gain from making better use of L2 / L1d cache.

在数据已经加载到寄存器中时处理数据很好.

Working with data while it's already loaded in registers is good.

这篇关于简单阵列处理循环的 AVX 512 与 AVX2 性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-31 01:29