问题描述
显然,MSVC ++ 2017工具集v141(x64版本配置)没有通过C/C ++内在函数使用FYL2X
x86_64汇编指令,而是使用C ++ log()
或log2()
用法导致对a long函数,它似乎实现了对数的近似值(不使用FYL2X
).我测得的性能也很奇怪:log()
(自然对数)比log2()
(以2为底的对数)快1.7667倍,尽管以2为底的对数对处理器来说应该更容易,因为它以二进制格式存储指数(并且也是尾数),这似乎就是为什么CPU指令FYL2X
计算以2为底的对数(乘以一个参数)的原因.
Apparently MSVC++2017 toolset v141 (x64 Release configuration) doesn't use FYL2X
x86_64 assembly instruction via a C/C++ intrinsic, but rather C++ log()
or log2()
usages result in a real call to a long function which seems to implement an approximation of logarithm (without using FYL2X
). The performance I measured is also strange: log()
(natural logarithm) is 1.7667 times faster than log2()
(base 2 logarithm), even though base 2 logarithm should be easier for the processor because it stores the exponent in binary format (and mantissa too), and that seems why the CPU instruction FYL2X
calculates base 2 logarithm (multiplied by a parameter).
以下是用于测量的代码:
Here is the code used for measurements:
#include <chrono>
#include <cmath>
#include <cstdio>
const int64_t cnLogs = 100 * 1000 * 1000;
void BenchmarkLog2() {
double sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int64_t i=1; i<=cnLogs; i++) {
sum += std::log2(double(i));
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}
void BenchmarkLn() {
double sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 1; i <= cnLogs; i++) {
sum += std::log(double(i));
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("Ln: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}
int main() {
BenchmarkLog2();
BenchmarkLn();
return 0;
}
Ryzen 1800X的输出为:
The output for Ryzen 1800X is:
Log2: 95152910.728 Ops/sec calculated 2513272986.435
Ln: 168109607.464 Ops/sec calculated 1742068084.525
因此,为了阐明这些现象(不使用FYL2X
和奇怪的性能差异),我还要测试FYL2X
的性能,如果速度更快,请使用它代替<cmath>
的功能. MSVC ++不允许在x64上进行内联汇编,因此需要使用FYL2X
的汇编文件功能.
So to elucidate these phenomena (no usage of FYL2X
and strange performance difference), I would like to also test the performance of FYL2X
, and if it's faster, use it instead of <cmath>
's functions. MSVC++ doesn't allow inline assembly on x64, so an assembly file function that uses FYL2X
is needed.
如果新的x86_64处理器上有任何功能,可以使用FYL2X
或更好的对数指令(不需要特定的基数)来回答此类函数的汇编代码吗?
Could you answer with the assembly code for such a function, that uses FYL2X
or a better instruction doing logarithm (without the need for specific base) if there is any on newer x86_64 processors?
推荐答案
以下是使用FYL2X
的汇编代码:
Here is the assembly code using FYL2X
:
_DATA SEGMENT
_DATA ENDS
_TEXT SEGMENT
PUBLIC SRLog2MulD
; XMM0L=toLog
; XMM1L=toMul
SRLog2MulD PROC
movq qword ptr [rsp+16], xmm1
movq qword ptr [rsp+8], xmm0
fld qword ptr [rsp+16]
fld qword ptr [rsp+8]
fyl2x
fstp qword ptr [rsp+8]
movq xmm0, qword ptr [rsp+8]
ret
SRLog2MulD ENDP
_TEXT ENDS
END
调用约定根据 https ://docs.microsoft.com/en-us/cpp/build/overview-of-x64-calling-conventions ,例如
C ++的原型是:
extern "C" double __fastcall SRLog2MulD(const double toLog, const double toMul);
性能比std::log2()
慢2倍,比std::log()
慢3倍以上:
The performance is 2 times slower than std::log2()
and more than 3 times slower than std::log()
:
Log2: 94803174.389 Ops/sec calculated 2513272986.435
FPU Log2: 52008300.525 Ops/sec calculated 2513272986.435
Ln: 169392473.892 Ops/sec calculated 1742068084.525
基准代码如下:
void BenchmarkFpuLog2() {
double sum = 0;
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 1; i <= cnLogs; i++) {
sum += SRPlat::SRLog2MulD(double(i), 1);
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("FPU Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}
这篇关于C ++和汇编语言中的对数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!