问题描述
由于数组:
int canvas[10][10];
int addon[10][10];
如果所有的值的范围从0 - 100,什么是C ++最快的方式来增加这两个数组所以在画布上每一个细胞等于本身加上插件对应的单元格的值
IE浏览器,我想实现这样的:
IE, I want to achieve something like:
canvas += another;
因此,如果帆布[0] [0] = 3和插件[0] [0] = 2,则帆布[0] [0] = 5
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
速度是这里必不可少的,因为我写一个非常简单的程序来暴力破解一个背包式的问题,将有几千万的组合。
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
而作为一个小的额外问题(感谢如果你能帮助!)这将是检查的最快方法(如果有)在画布中的值超过100?循环很慢!
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!
推荐答案
下面是一个SSE4实现,应该在Nehalem处理器执行pretty以及(酷睿i7):
Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
与编译GCC -msse4.1 ...
或同等特定的开发环境。
Compile with gcc -msse4.1 ...
or equivalent for your particular development environment.
对于没有SSE4较早的CPU(和更昂贵的错位加载/存储)你需要(a)使用SSE2 / SSE3内在的适当组合,以取代SSE4操作(标有 *
以上),最好(二)确保您的数据是对齐的16字节和使用对齐加载/存储( _mm_load_si128
/ _mm_store_si128
)代替 _mm_loadu_si128
/ _mm_storeu_si128
。
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an *
above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128
/_mm_store_si128
) in place of _mm_loadu_si128
/_mm_storeu_si128
.
这篇关于C ++赶快加入2阵列一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!