本文介绍了浮点运算(FLOPs)的定义是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SIMD(在ARM CPU上)优化代码,并想知道其算术强度(触发器/字节,AI)和FLOPS.

I'm trying to optimize my code with SIMD ( on ARM CPUs ), and want to know its arithmetic intensity (flops/byte, AI) and FLOPS.

为了计算AI和FLOPS,我必须计算浮点运算(FLOP)的数量.但是,我找不到FLOP的任何精确定义.
当然, mul add sub div 显然是FLOP,但是移动操作,随机操作如何?(例如 _mm_shuffle_ps ),设置操作(例如 _mm_set1_ps ),转换操作(例如 _mm_cvtps_pi32 )等?
它们是处理浮点值的操作.我应该把它们算作FLOP吗?如果没有,为什么?
像Intel VTune和Nvidia的 nvprof 这样的分析器或PMU通常会进行哪些操作?

In order to calculate AI and FLOPS, I have to count the number of floating point operations(FLOPs).However, I can't find any precise definition of FLOPs.
Of course, mul, add, sub, div are clearly FLOPs, but how about move operations, shuffle operations (e.g. _mm_shuffle_ps), set operations (e.g. _mm_set1_ps), conversion operations (e.g. _mm_cvtps_pi32), etc. ?
They're operations that deal with floating point values. Should I count them as FLOPs ? If not, why ?
Which operations do profilers like Intel VTune and Nvidia's nvprof, or PMUs usually count ?


FLOPS包括哪些所有操作?
这个问题主要是关于数学上复杂的运算.
我还想知道处理以浮点值或向量作为输入的非数学"运算的标准方法.


What all operations does FLOPS include?
This question is mainly about mathematically complex operations.
I also want to know the standard way to deal with "not mathematical" operations which take floating point values or vectors as inputs.

推荐答案

FP值的混洗/混合不视为FLOP.它们只是在不完全垂直"的网络上使用SIMD的开销.问题,或者对于混合问题您无分支地完成的分支问题.

Shuffle / blend on FP values are not considered FLOPs. They are just overhead of using SIMD on not purely "vertical" problems, or for problems with branching that you do branchlessly with a blend.

FP AND/OR/XOR都不一样.您可以尝试使用 andps ( _mm_and_ps )来证明对FP绝对值的计数是合理的,但是通常不计入.FP abs不需要查看指数/有效位数或对结果进行规范化,也不需要任何使FP执行单元变得昂贵的事情.abs(AND)/sign-flip(XOR)或使为负数(OR)是琐碎的按位运算.

Neither are FP AND/OR/XOR. You could try to justify counting FP absolute value using andps (_mm_and_ps), but normally it's not counted. FP abs doesn't require looking at the exponent / significand, or normalizing the result, or any of the things that make FP execution units expensive. abs (AND) / sign-flip (XOR) or make negative (OR) are trivial bitwise ops.

FMA通常被视为两个浮点运算(mul和add),即使它是一条与SIMD FP add或mul具有相同(或相似)性能的指令.原始 FLOP/s 瓶颈上最重要的问题是matmul,它确实需要mul和add的均等混合,并且可以完美地利用FMA.

FMA is normally counted as two floating point ops (the mul and add), even though it's a single instruction with the same (or similar) performance to SIMD FP add or mul. The most important problem that bottlenecks on raw FLOP/s is matmul, which does need an equal mix of mul and add, and can take advantage of FMA perfectly.

因此Haswell核心的FLOP是

So the FLOP/s of a Haswell core is

  • 其SIMD向量宽度(每个向量8个 float 个元素)
  • 每个时钟两次SIMD FMA(2)
  • 每个FMA(2)的FLOP次数
  • 时钟速度(在使两个FMA单元最大化时,它可以维持的最大单核睿频;长期取决于冷却,短期仅取决于功率限制).

对于整个CPU而言,不仅仅是一个核心:乘以核心数量,并在所有核心繁忙时使用最大持续时钟速度,通常比完全具有Turbo的CPU的单核Turbo低.)

For a whole CPU, not just a single core: multiply by number of cores and use the max sustained clock speed with all cores busy, usually lower than single-core turbo on CPUs that have turbo at all.)

英特尔和其他CPU供应商不认为它们的CPU还可在每个时钟周期内并行2条 vfma132ps 指令来维持 vandps 的事实,因为FP abs并非如此困难的操作.

Intel and other CPU vendors don't count the fact that their CPUs can also sustain a vandps in parallel with 2 vfma132ps instructions per clock, because FP abs is not a difficult operation.

另请参见我如何达到理论上每个周期最多4个FLOP?.(实际上在现代CPU上是4个以上:P)

See also How do I achieve the theoretical maximum of 4 FLOPs per cycle?. (It's actually more than 4 on modern CPUs :P)

如果您有很多其他开销占用前端带宽或造成其他瓶颈,则无法实现峰值FLOPS(每秒FP ops或FLOP/s).指标只是您直线运行时可以做的原始数学运算,而不是针对任何特定的实际问题.

Peak FLOPS (FP ops per second, or FLOP/s) isn't achievable if you have much other overhead taking up front-end bandwidth or creating other bottlenecks. The metric is just the raw amount of math you can do when running in a straight line, not on any specific practical problem.

尽管人们会认为,如果理论上的峰值触发器比经过精心手工调整的matmul或Mandelbrot所能达到的更高,即使对于编译时间恒定的问题大小,这也是愚蠢的.例如如果前端跟不上商店以及FMA.例如如果Haswell有四个FMA执行单元,那么实际上每个指令都是FMA时,它只能维持最大FLOP.内存源操作数可以微熔接负载,但是在不损害吞吐量的情况下没有存储空间.

Although people would think it's silly if theoretical peak flops is much higher than a carefully hand-tuned matmul or Mandelbrot could ever achieve, even for compile-time-constant problem sizes. e.g. if the front-end couldn't keep up with doing any stores as well as the FMAs. e.g. if Haswell had four FMA execution units, so it could only sustain max FLOPs if literally every instruction was an FMA. Memory source operands could micro-fuse for loads, but there'd be no room to store without hurting throughput.

Intel甚至没有3个FMA单元的原因是,大多数真实代码很难使2个FMA单元饱和,特别是只有2个装载端口和1个存储端口时.它们几乎总是被浪费掉,而256位FMA单元需要很多晶体管.

The reason Intel doesn't have even 3 FMA units is that most real code has trouble saturating 2 FMA units, especially with only 2 load ports and 1 store port. They'd be wasted almost all of the time, and 256-bit FMA unit takes a lot of transistors.

(Ice Lake将流水线的发布/重命名阶段扩展到5 uops/时钟,但是使用AVX-512而不是添加第3个256位FMA单元也将SIMD执行单元扩展到512位,它具有2/时钟加载 2个时钟存储区,尽管对于32字节或更窄的存储区(不是64字节的存储区)来说,该存储区的吞吐量仅对L1d缓存而言是可持续的.)

(Ice Lake widens issue/rename stage of the pipeline to 5 uops/clock, but also widens SIMD execution units to 512-bit with AVX-512 instead of adding a 3rd 256-bit FMA unit. It has 2/clock load and 2/clock store, although that store throughput is only sustainable to L1d cache for 32-byte or narrower stores, not 64-byte.)

这篇关于浮点运算(FLOPs)的定义是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-17 17:25