问题描述
我希望通过英特尔的SSE开始我的第一步,因此我遵循了发布的指南,不同之处在于我没有使用任何 _aligned_malloc
,而是为Windows和C ++开发,而不是为Windows和C开发。 posix_memalign
)。
我还实现了一个不使用SSE扩展的计算密集型方法。令人惊讶的是,当我运行这个程序时,两个代码段(一个与SSE相关,一个没有)运行的时间相似,通常是使用SSE的时间比另一个稍高。
这是正常的吗? GCC是否可以通过SSE进行优化(也使用 -O0
选项)?我也尝试了 -mfpmath = 387
选项,但是没办法,依然如此。
对于浮点运算,您可能看不到SSE带来的巨大收益。大多数现代x86 CPU都有两个FPU,所以双精度的SIMD和标量的速度可能差不多,单精度可能会给你两倍的SIMD标量。对于整数运算,例如图像或音频处理8位或16位,您仍然可以从SSE获得实质性收益。
I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc
but posix_memalign
).
I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.
Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0
option)? I also tried the -mfpmath=387
option, but no way, still the same.
For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.
这篇关于SSE优化代码执行类似于普通版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!