问题描述
我有一个包含四个64位浮点值的压缩向量.
我想获取向量元素的总和.
I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.
使用SSE(并使用32位浮点数),我可以执行以下操作:
With SSE (and using 32-bit floats) I could just do the following:
v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum, v_sum);
不幸的是,即使AVX具有_mm256_hadd_pd指令,其结果也与SSE版本不同.我认为,这是由于大多数AVX指令分别针对每个低128位和高128位充当SSE指令,而没有越过128位边界的事实.
Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.
理想情况下,我正在寻找的解决方案应遵循以下准则:
1)仅使用AVX/AVX2指令. (无SSE)
2)按照不超过2-3条说明进行操作.
Ideally, the solution I am looking for should follow these guidelines:
1) only use AVX/AVX2 instructions. (no SSE)
2) do it in no more than 2-3 instructions.
但是,任何有效/优雅的方法(即使不遵循上述准则)也总是可以接受的.
However, any efficient/elegant way to do it (even without following the above guidelines) is always well accepted.
非常感谢您的帮助.
-路易吉·卡斯特利(Luigi Castelli)
-Luigi Castelli
推荐答案
如果您有两个__m256d
向量x1
和x2
,每个向量包含四个要水平求和的double
,则可以:
If you have two __m256d
vectors x1
and x2
that each contain four double
s that you want to horizontally sum, you could do:
__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum1, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]
因此,看起来3条指令可以完成所需的2个水平求和.以上未经测试,但是您应该了解一下.
So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.
这篇关于使用AVX指令进行水平向量求和的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!