问题描述
int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
simdb = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
以上提到的code是我的程序多次调用(分析器显示98%)。
The above mentioned code is called many times in my program (profiler shows 98%).
编辑:在内环,RES1 [1 + k]的值被加载多次为相同第(i + k)的值。我这个尝试的while循环中,我加载完所有RES1值到SIMD寄存器(阵列)和使用数组元素最内层的内循环更新数组元素。一旦双方的for循环完成后,我存储在数组值回RES1,RE2。但是,计算时间与此增加。任何想法,我错了吗?这个想法似乎是正确的。
In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
任何建议,使其更快是值得欢迎的。
Any suggestion to make it faster is welcome.
推荐答案
不幸的是,最明显的优化可能已经被编译器完成:
Unfortunately the most obvious optimisations are probably already being done by the compiler:
- 您可以拉
&放大器; _mul pre [U1]
和&放大器; MUL pre [U2]
内环我们。 - 您可以拉
&放大器; RES1 [I]
内环我们 - 使用不同变量的两个内部操作,重新排列它们,可能会实现更好的流水线。
- You can pull
&_mulpre[u1]
and&mulpre[u2]
our of the inner loop. - You can pull
&res1[i]
our of the inner loop. - Using different variables for the two inner operations, and reordering them, might allow for better pipelining.
可能交换外循环就会提高缓存的区域性 elm1
。
Possibly swapping the outer loops would improve cache locality on elm1
.
这篇关于如何使以下code快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!