I have two unsigned vectors, both with size 4

vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}

Now I want to multiply these two vectors and get a new one

vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}

What is the SSE operation to use? Is it cross platform or onlyin some specified platforms?

Adding:If my goal is adding not multiplication, I can do this super fast:

__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);

Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.

To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b

For SSE version >=SSE4.1 use

_mm_mullo_epi32(a, b);

Without SSE4.1:

This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13    = _mm_shuffle_epi32(a, 0xF5);          // (-,a3,-,a1)
__m128i b13    = _mm_shuffle_epi32(b, 0xF5);          // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                 // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);             // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13);   // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13);   // (-,-,a3*b3,a2*b2)
__m128i prod   = _mm_unpacklo_epi64(prod01,prod23);   // (ab3,ab2,ab1,ab0)

