如何将32位浮点数转换为8位带符号字符

如何将32位浮点数转换为8位带符号字符

本文介绍了如何将32位浮点数转换为8位带符号字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做的是:

  1. 将输入浮点数乘以固定因子.
  2. 将它们转换为8位带符号的字符.

请注意,大多数输入都具有较小的绝对值范围,例如[-6、6],以便固定因子可以将它们映射到[-127、127].

我仅处理avx2指令集,因此无法使用像_mm256_cvtepi32_epi8这样的内在函数.我想使用_mm256_packs_epi16,但是它将两个输入混合在一起. :(

我还写了一些将32位浮点数转换为16位int的代码,它的工作原理与我想要的完全一样.

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

欢迎任何帮助,非常感谢!

解决方案

对于具有多个源矢量的良好吞吐量,具有2个输入矢量而不是产生较窄的输出是一件好事 . (AVX512 _mm256_cvtepi32_epi8不一定是最有效的处理方式,因为具有存储目标的版本会解码为多个uops,或者常规版本会为您提供多个小输出,需要分别存储.)

还是您在抱怨它的车内运行方式?是的,这很烦人,但是_mm256_packs_epi32做同样的事情.如果您的输出中可以有交错的数据组,也可以这样做.

您最好的选择是将2个矢量降低到1,并分两步进行车道内打包(因为没有交叉通道包).然后使用一个过马路的随机播放器对其进行修复.

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_ b_ c_ d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

(在Godbolt编译器资源管理器上可以很好地编译).

循环调用此函数,然后_mm256_store_si256生成矢量.


(对于uint8_t未签名的目标,对于16-> 8步使用_mm256_packus_epi16并保持其他所有内容不变.我们仍然使用32-> 16签名包装,因为16-> u8 vpackuswb包装仍然将其epi16 输入视为带符号.您需要将-1视为-1而不是+0xFFFF,以使无符号饱和将其钳位为0.)


每个256位存储共有4个shuffle,每个时钟吞吐量1个shuffle将成为Intel CPU的瓶颈.您应该获得每个时钟一个浮点矢量的吞吐量,这是端口5上的瓶颈. ( https://agner.org/optimize/).或者,如果L2中的数据不是很热,那么可能会成为内存带宽的瓶颈.


如果只需要一个单个向量,则可以考虑使用_mm256_shuffle_epi8将每个epi32元素的低字节放入每个通道的低32位,然后 _mm256_permutevar8x32_epi32 来进行车道交叉. /p>

另一个单向量替代方法(对Ryzen有利)是extracti128 + 128位packssdw + packsswb.但这仅在您仅执行单个向量的情况下仍然很好. (仍然使用Ryzen,您将需要使用128位向量来避免额外的通道交叉改组,因为Ryzen将每条256位指令分成(至少)2个128位uops.)

相关:

What I want to do is:

  1. Multiply the input floating point number by a fixed factor.
  2. Convert them to 8-bit signed char.

Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].

I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(

I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

Any help is welcome, thank you so much!

解决方案

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)

Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.

Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

(Compiles nicely on the Godbolt compiler explorer).

Call this in a loop and _mm256_store_si256 the resulting vector.


(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)


With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.


If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.

Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)

Related:

这篇关于如何将32位浮点数转换为8位带符号字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-30 15:30