问题描述
我想知道在C或C ++中使用浮点运算的任何代码在任何基于x86的体系结构中是否都能产生精确的结果,而与代码的复杂性无关.
I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.
据我所知,自Intel 8087以来的任何x86架构都使用了准备处理IEEE-754浮点数的FPU单元,而且我看不出在不同架构中结果会有所不同的任何原因.但是,如果它们不同(即由于不同的编译器或不同的优化级别),那么是否有某种方法可以通过仅配置编译器来产生位精确的结果?
To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?
推荐答案
目录:
- C/C ++
- asm
- 创建实现这一目标的现实软件.
否,完全符合ISO C11和IEEE标准的C实现不能保证与其他C实现(甚至是同一硬件上的其他实现)完全相同的结果.
(首先,我假设我们正在谈论普通的C实现,其中double
是 IEEE-754 binary64格式等,即使对于x86上的C实现,将其他格式用于double
并使用软件仿真来实现FP数学并进行定义也是合法的当不是所有的x86 CPU都包含在FPU中时,这可能是合理的,但是在2016年是 Deathstation 9000 领土.)
(And first of all, I'm going to assume we're talking about normal C implementations where double
is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double
and implement FP math with software emulation, and define the limits in float.h
. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)
相关:布鲁斯·道森(Bruce Dawson)的浮点确定性博客帖子是这个问题的答案.他的开场白很有趣(紧随其后的是很多有趣的东西):
related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):
如果您正在考虑这个问题,那么您肯定会想看看布鲁斯有关浮点数学的系列文章的索引,由C编译器在x86以及asm和通常的IEEE FP上实现.
If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.
第一个问题:仅允许,即< = 0.5ulp错误,正确舍入到尾数的最后一位,因此结果是最接近精确结果的可表示值.
First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.
其他数学库功能(例如pow()
,log()
和sin()
)允许实现者在速度和准确性之间进行权衡.例如,对于某些功能IIRC,glibc通常有利于准确性,并且比Apple的OS X数学库慢.另请参见 glibc关于每个libm的错误范围的文档跨不同架构的功能.
Other math library functions like pow()
, log()
, and sin()
allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.
但是等等,情况变得更糟.即使仅使用正确舍入的基本运算的代码也不能保证相同的结果.
But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.
C规则还允许在保留更高精度的临时对象方面具有一定的灵活性.该实现定义了 FLT_EVAL_METHOD
,因此代码可以检测其工作原理,但是如果您不喜欢实现的功能,那么您将没有选择.您确实可以选择(使用#pragma STDC FP_CONTRACT off
)禁止编译器进入例如在添加之前将a*b + c
转换为FMA,且不对a*b
临时变量进行四舍五入.
C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD
so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off
) to forbid the compiler from e.g. turning a*b + c
into an FMA with no rounding of the a*b
temporary before the add.
在x86上,面向32位非SSE代码的编译器(即使用过时的x87指令)通常会在两次操作之间将FP临时变量保留在x87寄存器中.这将产生80位精度的FLT_EVAL_METHOD = 2
行为. (该标准规定,四舍五入仍会在每个赋值上进行,但是像gcc这样的实际编译器实际上不会进行额外的存储/重载以进行四舍五入,除非您使用-ffloat-store
.请参见 https://gcc.gnu.org/wiki/FloatingPointMath .该部分标准似乎是在假设非优化编译器或有效提供硬件的硬件的情况下编写的四舍五入为非x86之类的类型宽度,或者像x87那样将精度设置为四舍五入为64位double
而不是80位long double
.在每个语句之后存储,并且该标准允许在评估一个表达式.)
On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2
behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store
. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double
instead of 80-bit long double
. Storing after every statement is exactly what gcc -O0
and most other compilers do, and the standard allows extra precision within evaluation of one expression.)
因此,当针对x87时,允许编译器使用两个x87 FADD
指令评估三个float
的总和,而无需将前两个之和四舍五入为32位float
.在这种情况下,临时文件的精度为80位...还是吗?并非总是如此,因为C实现的启动代码(或Direct3D库!!!)可能已经更改了x87控制字中的精度设置,因此x87寄存器 中的值四舍五入为53或24位尾数. (这使FDIV和FSQRT的运行速度更快.)所有这些均来自 Bruce Dawson关于FP中间精度的文章).
So when targeting x87, the compiler is allowed to evaluate the sum of three float
s with two x87 FADD
instructions, without rounding off the sum of the first two to a 32-bit float
. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).
在舍入模式和精度设置相同的情况下,我认为每个x86 CPU都应为相同的输入给出位相同的结果,即使对于复杂的x87指令(如FSIN)也是如此.
With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.
英特尔的手册并未确切定义每种情况下的结果,但我认为英特尔旨在实现位精确的向后兼容性.我怀疑他们是否会为FSIN添加扩展精度范围缩小.它使用您通过fldpi
获得的80位pi常数(正确舍入的64位尾数,实际上是66位,因为精确值的后2位为零).英特尔关于最坏情况错误的文档减少了1.3倍五分之一">直到Bruce Dawson注意到实际情况最糟糕之后,他们才对其进行了更新.但这只能通过扩大精度范围来解决,因此在硬件上并不便宜.
Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi
(correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.
我不知道AMD是否实现了他们的FSIN和其他微编码指令来始终为英特尔提供与位相同的结果,但是我不会感到惊讶.我认为某些软件确实依赖它.
I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.
由于 SSE仅提供有关add/sub/mul/div/sqrt的说明,所以没有什么好说的了.它们完全实现了IEEE操作,因此任何x86实现都不可能给您带来任何不同的结果(除非舍入模式设置不同,或者非正规数为零和/或刷新为零是不同的,并且您拥有异常).
Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).
SSE rsqrt
(快速近似倒数平方根)未完全指定,我认为即使在进行牛顿迭代后,您也可能会得到不同的结果,但是除了SSE/SSE2在asm中总是很精确,假设MXCSR设置不奇怪.因此,唯一的问题是让编译器生成相同的代码,或者只是使用相同的二进制文件.
SSE rsqrt
(fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.
因此,如果您静态链接使用SSE/SSE2的libm
并分发这些二进制文件,则它们将在各处运行相同的文件.除非该库使用运行时CPU检测来选择替代实现...
So, if you statically link a libm
that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...
正如@Yan Zhou指出的那样,您非常需要控制实现的每一部分,直到asm才能获得精确的结果.
As @Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.
但是,对于多人游戏,某些游戏确实确实依赖于此,但是对于不同步的客户端,通常会进行检测/纠正.每个客户端无需计算每帧通过网络发送的整个游戏状态,而是可以计算接下来发生的事情.如果精心设计了确定性的游戏引擎,它们将保持同步.
However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.
在Spring RTS中,客户端对游戏状态进行校验和以检测不同步.我已经有一段时间没有玩过游戏了,但我确实记得至少在5年前读过一些有关他们如何通过确保所有x86版本都使用SSE数学甚至32位版本来实现同步的内容.
In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.
某些游戏不允许在PC和非x86控制台系统之间进行多人游戏的一个可能原因是,引擎在所有PC上给出的结果相同,但在具有不同编译器的不同体系结构的控制台上给出的结果却不同.
One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.
进一步阅读: GAFFER ON GAMES:浮点确定性 .真实游戏引擎用来获得确定性结果的一些技术.例如将sin/cos/tan包装在未优化的函数调用中,以强制编译器将其保留为单精度.
Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.
这篇关于在任何基于x86的体系结构中,是否有浮点密集型代码会产生位精确的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!