SSE优化代码执行类似于普通版本 | SSE优化代码执行类似于普通版本

本文介绍了SSE优化代码执行类似于普通版本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望通过英特尔的SSE开始我的第一步，因此我遵循了发布的指南，不同之处在于我没有使用任何 _aligned_malloc ，而是为Windows和C ++开发，而不是为Windows和C开发。 posix_memalign ）。

我还实现了一个不使用SSE扩展的计算密集型方法。令人惊讶的是，当我运行这个程序时，两个代码段（一个与SSE相关，一个没有）运行的时间相似，通常是使用SSE的时间比另一个稍高。

这是正常的吗？ GCC是否可以通过SSE进行优化（也使用 -O0 选项）？我也尝试了 -mfpmath = 387 选项，但是没办法，依然如此。

解决方案

对于浮点运算，您可能看不到SSE带来的巨大收益。大多数现代x86 CPU都有两个FPU，所以双精度的SIMD和标量的速度可能差不多，单精度可能会给你两倍的SIMD标量。对于整数运算，例如图像或音频处理8位或16位，您仍然可以从SSE获得实质性收益。

I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).

I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.

Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.

解决方案

For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.

这篇关于SSE优化代码执行类似于普通版本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！