int canvas[10][10];
int addon[10][10];

IE, I want to achieve something like:

canvas += another;

So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5


Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.


And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!



Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):

#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>

static inline int canvas_add(int canvas[10][10], int addon[10][10])
    __m128i * cp = (__m128i *)&canvas[0][0];
    const __m128i * ap = (__m128i *)&addon[0][0];
    const __m128i vlimit = _mm_set1_epi32(100);
    __m128i vmax = _mm_set1_epi32(INT_MIN);
    __m128i vcmp;
    int cmp;
    int i;

    for (i = 0; i < 10 * 10; i += 4)
        __m128i vc = _mm_loadu_si128(cp);
        __m128i va = _mm_loadu_si128(ap);

        vc = _mm_add_epi32(vc, va);
        vmax = _mm_max_epi32(vmax, vc);   // SSE4 *

        _mm_storeu_si128(cp, vc);

    vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
    cmp = _mm_testz_si128(vcmp, vcmp);    // SSE4 *
    return cmp == 0;

Compile with gcc -msse4.1 ... or equivalent for your particular development environment.

For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

