问题描述
埃尔马
和 ELMC
都是无符号长
阵列。那么, RES1
和 RES2
。
elma
and elmc
are both unsigned long
arrays. So are res1
and res2
.
unsigned long simdstore[2];
__m128i *p, simda, simdb, simdc;
p = (__m128i *) simdstore;
for (i = 0; i < _polylen; i++)
{
u1 = (elma[i] >> l) & 15;
u2 = (elmc[i] >> l) & 15;
for (k = 0; k < 20; k++)
{
//res1[i + k] ^= _mulpre1[u1][k];
//res2[i + k] ^= _mulpre2[u2][k];
simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);
simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);
simdc = _mm_xor_si128 (simda, simdb);
_mm_store_si128 (p, simdc);
res1[i + k] = simdstore[0];
res2[i + k] = simdstore[1];
}
}
内循环包括两个非SIMD和元素的异或的SIMD版本。第二中前两行的for循环做明确的XOR,而其余的实现相同操作的SIMD版本。
Within the for loop is included both the non-simd and simd version of XOR of elements. First two lines within the second for loop do the explicit XOR, whereas the rest implements the simd version of the same operation.
这个循环从外部数百次调用,所以这个优化循环将有助于降低总的计算时间。
This loop is called from outside hundreds of times, so optimizing this loop will help bring down the total computation time.
问题是SIMD code的运行速度比标量code要慢许多倍。
The problem is simd code runs many times slower than the scalar code.
编辑:
完成部分展开
Done partial unrolling
__m128i *p1, *p2, *p3, *p4;
p1 = (__m128i *) simdstore1;
p2 = (__m128i *) simdstore2;
p3 = (__m128i *) simdstore3;
p4 = (__m128i *) simdstore4;
for (i = 0; i < 20; i++)
{
u1 = (elma[i] >> l) & 15;
u2 = (elmc[i] >> l) & 15;
for (k = 0; k < 20; k = k + 4)
{
simda1 = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);
simda2 = _mm_set_epi64x (_mulpre2[u2][k + 1], _mulpre1[u1][k + 1]);
simda3 = _mm_set_epi64x (_mulpre2[u2][k + 2], _mulpre1[u1][k + 2]);
simda4 = _mm_set_epi64x (_mulpre2[u2][k + 3], _mulpre1[u1][k + 3]);
simdb1 = _mm_set_epi64x (res2[i + k], res1[i + k]);
simdb2 = _mm_set_epi64x (res2[i + k + 1], res1[i + k + 1]);
simdb3 = _mm_set_epi64x (res2[i + k + 2], res1[i + k + 2]);
simdb4 = _mm_set_epi64x (res2[i + k + 3], res1[i + k + 3]);
simdc1 = _mm_xor_si128 (simda1, simdb1);
simdc2 = _mm_xor_si128 (simda2, simdb2);
simdc3 = _mm_xor_si128 (simda3, simdb3);
simdc4 = _mm_xor_si128 (simda4, simdb4);
_mm_store_si128 (p1, simdc1);
_mm_store_si128 (p2, simdc2);
_mm_store_si128 (p3, simdc3);
_mm_store_si128 (p4, simdc4);
res1[i + k]= simdstore1[0];
res2[i + k]= simdstore1[1];
res1[i + k + 1]= simdstore2[0];
res2[i + k + 1]= simdstore2[1];
res1[i + k + 2]= simdstore3[0];
res2[i + k + 2]= simdstore3[1];
res1[i + k + 3]= simdstore4[0];
res2[i + k + 3]= simdstore4[1];
}
}
不过,结果并没有太大变化;它仍然需要两次,只要标量code。
But, the result does not change much; it still takes twice as long as scalar code.
推荐答案
免责声明:我来自一个PowerPC的背景,所以我在说什么在这里可能是完全一派胡言。但你拖延你的载体管道,因为您尝试访问结果的时候了。
Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.
最好是在你的向量流水线的一切。只要你做任何实物载体转化为整数或浮点数,或并将结果存储到内存中,你拖延。
It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.
与SSE或VMX打交道时操作的最佳方式是:加载,处理,存储。数据加载到您的向量寄存器,做所有的向量处理,然后将其存储到内存中。
The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.
我建议:保留几个__m128i寄存器,展开您的多次循环,然后存储它
I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.
编辑:另外,如果你展开,如果你用16字节对齐RES1和RES2,您可以直接在内存中存储你的结果,而通过这个simdstore间接,这可能是一个LHS和另一个摊档去
Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.
编辑:忘记明显。如果您polylen通常很大,不要忘了做每次迭代数据缓存prefetch。
Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.
这篇关于SIMD code的运行速度比标量code慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!