x86 - 使用sse指令复数Mul和Div

通过SSE指令执行复杂的乘法和除法是否有好处？
我知道使用SSE时加减法会更好。有人可以告诉我如何使用SSE执行复杂的乘法以获得更好的性能吗？

最佳答案

出于完整性考虑，可以下载here的《英特尔®64和IA-32体系结构优化参考手册》包含用于复数乘法（例6-9）和复数除法（例6-10）的程序集。

例如，下面是乘法代码：

// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0

程序集直接映射到gccs X86 intrinsics（只需用__builtin_ia32_作为每个指令的谓词）。