问题描述
假设我有两个向量,分别由两个类型为double
的数组表示,每个数组的大小为2.我想添加相应的位置.因此,假设向量i0
和i1
,我想将i0[0] + i1[0]
和i0[1] + i1[1]
加在一起.
Assume I have two vectors represented by two arrays of type double
, each of size 2. I'd like to add corresponding positions. So assume vectors i0
and i1
, I'd like to add i0[0] + i1[0]
and i0[1] + i1[1]
together.
因为类型是double
,所以我需要两个寄存器.诀窍是将i0[0]
和i1[0]
以及i0[1]
和i1[1]
放入另一个,然后将寄存器本身添加进去.
Since the type is double
, I would need two registers. The trick would be to put i0[0]
and i1[0]
, and i0[1]
and i1[1]
in another and just add the register with itself.
我的问题是,如果我依次调用_mm_load_ps(i0[0])
和_mm_load_ps(i1[0])
,会将它们分别放在低64位和高64位中,还是将其替换为第二个load
?我如何将两个双打放置在同一个寄存器中,以便之后可以调用add_ps
?
My question is, if I call _mm_load_ps(i0[0])
and then _mm_load_ps(i1[0])
, will that place them in the lower and upper 64-bits separately, or will it replace the register with the second load
? How would I place both doubles in the same register, so I can call add_ps
after?
谢谢
推荐答案
我认为您想要的是
double i0[2];
double i1[2];
__m128d x1 = _mm_load_pd(i0);
__m128d x2 = _mm_load_pd(i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now
执行_mm_load_pd
时,它将第一个双精度值放入寄存器的低64位,第二个双精度值放入高64位.因此,在上述负载之后,x1
会保留两个double
值i0[0]
和i0[1]
(对于x2
类似).对_mm_add_pd
的调用会在x1
和x2
中垂直添加相应的元素,因此在添加后,sum
将i0[0] + i1[0]
保留在其低64位中,并将i0[1] + i1[1]
保留在其高64位中.
When you do a _mm_load_pd
, it puts the first double into the lower 64 bits of the register and the second into the upper 64 bits. So, after the loads above, x1
holds the two double
values i0[0]
and i0[1]
(and similar for x2
). The call to _mm_add_pd
vertically adds the corresponding elements in x1
and x2
, so after the addition, sum
holds i0[0] + i1[0]
in its lower 64 bits and i0[1] + i1[1]
in its upper 64 bits.
我应该指出,使用_mm_load_pd
代替_mm_load_ps
没有任何好处.正如函数名称所指示的那样,pd
变量显式加载两个压缩双精度数,而ps
版本则加载四个压缩的单精度浮点数.由于这些纯粹是逐位存储移动,并且它们都使用SSE浮点单元,因此使用_mm_load_ps
加载double
数据不会有任何损失.而且,_mm_load_ps
有一个好处:它的指令编码比_mm_load_pd
短一个字节,因此从指令缓存的意义(以及可能的指令解码)来看,它效率更高;我并不是所有复杂情况的专家现代x86处理器).上面使用_mm_load_ps
的代码如下所示:
I should point out that there is no benefit to using _mm_load_pd
instead of _mm_load_ps
. As the function names indicate, the pd
variety explicitly loads two packed doubles and the ps
version loads four packed single-precision floats. Since these are purely bit-for-bit memory moves and they both use the SSE floating-point unit, there is no penalty to using _mm_load_ps
to load in double
data. And, there is a benefit to _mm_load_ps
: its instruction encoding is one byte shorter than _mm_load_pd
, so it is more efficient from an instruction cache sense (and potentially instruction decoding; I'm not an expert on all of the intricacies of modern x86 processors). The above code using _mm_load_ps
would look like:
double i0[2];
double i1[2];
__m128d x1 = (__m128d) _mm_load_ps((float *) i0);
__m128d x2 = (__m128d) _mm_load_ps((float *) i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now
强制转换没有任何功能;它只是使编译器将SSE寄存器的内容重新解释为保持双精度值而不是浮点数,以便可以将其传递给双精度算术函数_mm_add_pd
.
There is no function implied by the casts; it simply makes the compiler reinterpret the SSE register's contents as holding doubles instead of floats so that it can be passed into the double-precision arithmetic function _mm_add_pd
.
这篇关于上证所新增中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!