在i7 3770k的单核上,它的工作速度非常快,每秒可进行80-90M次迭代,每次迭代可提供8次正弦和8个cose.如果我每次迭代调用8 sinf()和8 cosf(),则与〜15Mhz相比(msvc2017 x64库中的函数,带有avx编译器设置) UPD:还有一个很棒的 FastTrigo 代码示例,其中 FT :: sincos()函数比Julien Pommier的实现快20%.而他的 FT :: sincos()提供了准确的10位保证精度.This is a question addresed to users, experienced in SSE/AVX instruction family, and those of them, who are familiar with its performance analysis. I saw a lot of different implementations and approaches, ranging from older for SSE2 to newer ones. Web is flooded with such a links. But personally i am not deeply experienced in sse assembly analyze. Some people are pointing out to the uops, caches, and that requires some low level knowledge. So i am asking for an hints and your personal experiences. If you have some time to roll out some comparison, on "What is fastest" and why, what approaches you looked at. Implementation maybe not so precise, 10-16 bits of single FP precision is good enough. More is better, but when it does not affect speed.PS. To try to avoid meta flood, i could describe task precisely with details:Given scalar argument x (in radians), that is passed in xmm register (according to x64 fastcall convention).Write a function with signature __m128 sincos(float x); that returns its sin(x) and cos(x) values approximations.Return value should be inside one xmm register and to be calculated in a fastest possible manner, to satisfy 10-bit precision requirement.Argument could be any real number (but not nan, inf, so on). In case if argument normalisation is required by approach its performant implementation(fmod()) would be also the subject. But question is not about handling special FP cases.This may be a duplicate, but i have failed to find similar question here, so please point me, if there is already one. 解决方案 I have discovered great modern revision of Julien Pommier implementations, ported for AVX/AVX2 under zlib, thanks to Giovanni Garberoglio:http://software-lisc.fbk.eu/avx_mathfun/It works really fast, 80-90M iterations per second on single core of i7 3770k, giving 8 sines and 8 coses per iteration. compared to ~15Mhz if i call 8 sinf() and 8 cosf() per iteration (functions from msvc2017 x64 library, with avx compiler settings)UPD:Also there is an excellent FastTrigo code samples, where FT::sincos() function is 20% faster than Julien Pommier's implementation. And his FT::sincos() provides exactly 10 bit of guranteed accuracy. 这篇关于如何在x64 CPU上快速计算sincos?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-29 14:11