本文介绍了使用SSE2(作为浮点)缩放字节像素值(y = ax + b)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算 p>




将字节扩展为浮点



如果无法有效生成每个像素的固定点 c>


$ b

这是一个很好的用例 pmovzx (SSE4.1),因为它可以直接从8b元素到32b。其他选项是具有多个步骤的SSE2 punpck [l / h] bw / punpck [l / h] wd ,或SSSE3 pshufb 模拟 pmovzx 。 (你可以做一个16B加载和shuffle它4种不同的方式解压缩到32b的四个向量。)

  char * f 

// const __m128i zero = _mm_setzero_si128();
for(i = 0; i __m128 a = get_a(i);
__m128 b = get_b(i);

// IDK为什么没有使用`pmovzx`作为加载的内在函数,因为它需要一个m32或m64操作数,而不是m128。 (不像punpck *)
__m128i unsigned_dwords = _mm_cvtepu8_epi32((__ m128i)(buf + i)); //一次加载4B。
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);

floats = _mm_fmadd_ps(float,a,b); //有FMA可用,这可能是256b向量,即使有不同的车道交叉语义pmovzx与punpck
//或没有FMA的不便,这样做_mm_mul_ps和_mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(float);

//对buf + 4,buf + 8和buf + 12再重复3次,然后:

__m128i packed01 = _mm_packss_epi32(dwords0,dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2,dwords3);
// packuswb想要SIGNED输入,所以在第一步的符号饱和

//饱和到[0..255]范围
__m12i8 packedbytes = _mm_packus_epi16(packed01,packed23 ); // SSE2
_mm_store_si128(buf + i,packedbytes); //或如果buf未对齐,则为storeu。
}

//清除代码以处理奇数的最多15个剩余字节,如果n%16!= 0

此回答的前一版本从float-> uint8向量与packusdw / packuswb,并且有一个整体部分没有SSE4.1的解决方法。如果您只需保留在有符号整数域中,直到最后一个数据包为止,则不需要在未签名的数据包之后屏蔽符号位/ a>。我认为这是为什么SSE2只包括从双字到字的签名包,但都从字到字节的签名和未签名的包。 packuswd 仅在您的最终目标是 uint16_t 时才有用,而不是进一步打包。






最后一个CPU到不是有SSE4.1是Intel Conroe / merom(第一代Core2,从2007年年底之前)和AMD前巴塞罗那(2007年底之前)。如果这些CPU可以接受工作缓慢,只需为AVX2编写一个版本,为SSE4.1编写一个版本。或者SSSE3(使用4x pshufb来模拟寄存器的四个32b元素的pmovzxbd)pshufb在Conroe上很慢,所以如果你关心没有SSE4.1的CPU,写一个特定的版本。实际上,Conroe / merom也有慢xmm punpcklbw 等等(除了q-> dq)。 4x slow pshufb 应该仍然会打6x缓慢解压缩。矢量化是一个很少的前沃尔夫戴的胜利,因为慢速洗牌开箱和重新包装。



查看未完成的使用<$ c $的尝试的编辑历史记录c> punpck 之前,我意识到有多少额外的指令,它是需要的。删除它,因为这个答案已经很长,另一个代码块会混淆。


I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float

Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..

After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).

The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?

解决方案

a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them

Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.


Fixed point

If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.

There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a*2^4 and b*2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.

float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale;  // fixed point scale for a: 2^4
const int bscale = 1<<logbscale;  // fixed point scale for b: 2^1

const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale));  // re-scale b to match a in the 16bit temporary result

for (i=0 ; i<n; i+=16) {
    //__m128i avec = get_scaled_a(i);
    //__m128i bvec = get_scaled_b(i);
    //__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
    //__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);

    __m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) );  // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.

    __m128i block = _mm_load_si128(&buf[i]);  // call this { v[0] .. v[15] }

    __m128i lo = _mm_unpacklo_epi8(block, brescale);  // {v[0], 8, v[1], 8, ...}
    __m128i hi = _mm_unpackhi_epi8(block, brescale);  // {v[8], 8, v[9], 8, ...
    lo = _mm_maddubs_epi16(lo, abvec);  // first arg is unsigned bytes, 2nd arg is signed bytes
    hi = _mm_maddubs_epi16(hi, abvec);
    // lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }

    lo = _mm_srli_epi16(lo, logascale);  // truncate from scaled fixed-point to integer
    hi = _mm_srli_epi16(hi, logascale);

    // and re-pack.  Logical, not arithmetic right shift means sign bits can't be set
    block = _mm_packuswb(lo, hi);
    _mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop

2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.

Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.

For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:

  • _mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
  • _mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
  • _mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
  • _mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.

To use these, you'd get scaled 16b vectors of a and b values, then:

  • unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
  • left shift the words by 7.
  • multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
  • right shift here if you wanted different scales for a and b, to get more fractional precision for a
  • add b to that.
  • right shift to do the final truncation back from fixed point to [0..255].

With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.

If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.


expand bytes to floats

If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.

For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.

This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)

char *buf;

// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
    __m128 a = get_a(i);
    __m128 b = get_b(i);

    // IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128.  (unlike punpck*)
    __m128i unsigned_dwords = _mm_cvtepu8_epi32((__m128i)(buf+i));  // load 4B at once.
    __m128 floats = _mm_cvtepi32_ps(unsigned_dwords);

    floats = _mm_fmadd_ps(floats, a, b);  // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
    // or without FMA, do this with _mm_mul_ps and _mm_add_ps
    unsigned_dwords = _mm_cvtps_epi32(floats);

    // repeat 3 more times for buf+4, buf+8, and buf+12, then:

    __m128i packed01 = _mm_packss_epi32(dwords0, dwords1);  // SSE2
    __m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
    // packuswb wants SIGNED input, so do signed saturation on the first step

    // saturate into [0..255] range
    __m12i8 packedbytes=_mm_packus_epi16(packed01, packed23);  // SSE2
    _mm_store_si128(buf+i, packedbytes);  // or storeu if buf isn't aligned.
}

// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0

The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.


The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.

See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.

这篇关于使用SSE2(作为浮点)缩放字节像素值(y = ax + b)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 10:19