本文介绍了快速将第二个字节复制到新的存储区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种快速的方法来将每个第二个字节复制到一个新的malloc存储区.我有一个包含RGB数据和每个通道16位(48位)的原始图像,并且想要创建一个具有每个通道8位(24位)的RGB图像.

I need a fast way to copy every second byte to a new malloc'd memory area.I have a raw image with RGB data and 16 bits per channel (48 bit) and want to create an RGB image with 8 bits per channel (24 bit).

有没有比按字节复制更快的方法?我对SSE2不太了解,但我想SSE/SSE2可以实现.

Is there a faster method than copying bytewise?I don't know much about SSE2, but I suppose it's possible with SSE/SSE2.

推荐答案

您的RGB数据已打包,因此我们实际上不必关心像素边界.问题是仅打包数组的每个其他字节. (至少在图像的每一行中;如果使用16或32B的行距,则填充可能不是像素总数.)

Your RGB data is packed, so we don't actually have to care about pixel boundaries. The problem is just packing every other byte of an array. (At least within each row of your image; if you use a row stride of 16 or 32B, the padding might not be a whole number of pixels.)

使用SSE2,AVX或AVX2随机播放可以有效地完成此操作. (还有AVX512BW,也许还有更多的AVX512VBMI,但首批AVX512VBMI CPU可能没有非常高效的 vpermt2b,一个2输入通道交叉字节随机播放.)

This can be done efficiently using SSE2, AVX, or AVX2 shuffles. (Also AVX512BW, and maybe even more with AVX512VBMI, but the first AVX512VBMI CPUs probably won't have a very efficient vpermt2b, a 2-input lane-crossing byte shuffle.)

您可以使用SSSE3 pshufb来获取所需的字节,但这只是一个1输入的混洗,将为您提供8个字节的输出.一次存储8个字节比一次存储16个字节需要更多的总存储指令. (自Haswell以来,您还会在Intel CPU上出现混洗吞吐量的瓶颈,因为Haswell仅具有一个混洗端口,因此每个时钟的混洗吞吐量为一个). (您也可以考虑使用2x pshufb + por来填充16B存储,这在Ryzen上可能会很好.请使用2种不同的随机控制向量,一种将结果置于低64b值,另一种将结果置于64b高位64b.请参见转换8个16位SSE注册为8位数据).

You can use SSSE3 pshufb to grab the bytes you want, but it's only a 1-input shuffle that will give you 8 bytes of output. Storing 8 bytes at a time takes more total store instructions than storing 16 bytes at a time. (You'd also bottleneck on shuffle throughput on Intel CPUs since Haswell, which only have one shuffle port and thus one-per clock shuffle throughput). (You could also consider 2xpshufb + por to feed a 16B store, and that could be good on Ryzen. Use 2 different shuffle control vectors, one that puts the result in the low 64b and one that puts the result in the high 64b. See Convert 8 16 bit SSE register to 8bit data).

相反,使用 _ mm_packus_epi16 ( packuswb ).但是,由于它会饱和,而不是丢弃不需要的字节,因此必须将要保留在每个16位元素的低字节中的数据提供给输入.

Instead, it's probably a win to use _mm_packus_epi16 (packuswb). But since it saturates instead of discarding bytes you don't want, you have to feed it input with the data you want to keep in the low byte of each 16-bit element.

在您的情况下,这可能是每个RGB16分量的高字节,并丢弃了每个颜色分量的8个最低有效位.即_mm_srli_epi16(v, 8). 要将每个16位元素的高字节清零,请改用_mm_and_si128(v, _mm_set1_epi16(0x00ff)) . (在那种情况下,不要忘了使用不对齐的负载替换其中一个移位的所有内容;这是简单的情况,您应该只使用两个AND来填充PACKUS.)

In your case, that's probably the high byte of each RGB16 component, discarding the 8 least-significant bits from each color component. i.e. _mm_srli_epi16(v, 8). To zero the high byte in each 16-bit element, use _mm_and_si128(v, _mm_set1_epi16(0x00ff)) instead. (In that case, nevermind all the stuff about using an unaligned load to replace one of the shifts; that's the easy case and you should just use two ANDs to feed a PACKUS.)

或多或少地是gcc和clang在-O3处自动矢量化它的方式.除非它们既弄糟又浪费大量指令( https://gcc.gnu. org/bugzilla/show_bug.cgi?id = 82356 https://bugs.llvm.org/show_bug.cgi?id=34773 ).尽管如此,让它们使用SSE2(x86-64的基线)或NEON(用于ARM或其他工具)自动矢量化是获得一定性能的一种安全的好方法,同时又不会在手动矢量化时引入错误.除编译器错误外,它们生成的任何内容都将正确实现此代码的C语义,该语义适用于任何大小和对齐方式:

That's more or less how gcc and clang auto-vectorize this, at -O3. Except they both screw up and waste significant instructions (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356, https://bugs.llvm.org/show_bug.cgi?id=34773). Still, letting them auto-vectorize with SSE2 (baseline for x86-64), or with NEON for ARM or whatever, is a good safe way to get some performance without risk of introducing bugs while manually vectorizing. Outside of compiler bugs, anything they generate will correctly implement the C semantics of this code, which works for any size and alignment:

// gcc and clang both auto-vectorize this sub-optimally with SSE2.
// clang is *really* sub-optimal with AVX2, gcc no worse
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t *__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

查看代码+ ASM此及以后的版本

// Compilers auto-vectorize sort of like this, but with different
// silly missed optimizations.
// This is a sort of reasonable SSE2 baseline with no manual unrolling.
void pack_high8(uint8_t *restrict dst, const uint16_t *restrict src, size_t bytes) {
  // TODO: handle non-multiple-of-16 sizes
  uint8_t *end_dst = dst + bytes;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)src);
     __m128i v1 = _mm_loadu_si128(((__m128i*)src)+1);
     v0 = _mm_srli_epi16(v0, 8);
     v1 = _mm_srli_epi16(v1, 8);
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_storeu_si128((__m128i*)dst, pack);
     dst += 16;
     src += 16;  // 32 bytes, unsigned short
  } while(dst < end_dst);
}


但是在许多微体系结构中(在Skylake之前的Intel,AMD Bulldozer/Ryzen),矢量移位吞吐量限制为每个时钟1个.而且,直到AVX512才有load + shift asm指令,因此很难通过管道进行所有这些操作. (即,我们很容易在前端出现瓶颈.)


But vector shift throughput is limited to 1 per clock in many microarchitectures (Intel before Skylake, AMD Bulldozer/Ryzen). Also, there's no load+shift asm instruction until AVX512, so it's hard to get all these operations through the pipeline. (i.e. we easily bottleneck on the front-end.)

我们可以从偏移一个字节的地址加载数据,而无需移位,因此我们想要的字节位于正确的位置. AND屏蔽所需的字节具有良好的吞吐量,尤其是在AVX中,编译器可以将load + and折叠为一条指令.如果输入是32字节对齐的,并且我们仅对奇数矢量执行此offset-load技巧,则我们的负载将永远不会超过缓存行边界.通过循环展开,这可能是许多CPU上SSE2或AVX(不带AVX2)的最佳选择.

Instead of shifting, we can load from an address that's offset by one byte so the bytes we want are in the right place. AND to mask off the bytes we want has good throughput, especially with AVX where the compiler can fold the load+and into one instruction. If the input is 32-byte aligned, and we only do this offset-load trick for the odd vectors, our loads will never cross a cache-line boundary. With loop unrolling, this is probably the best bet for SSE2 or AVX (without AVX2) across many CPUs.

// take both args as uint8_t* so we can offset by 1 byte to replace a shift with an AND
// if src is 32B-aligned, we never have cache-line splits
void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)src);
     __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
     v0 = _mm_srli_epi16(v0, 8);
     __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_store_si128((__m128i*)dst, pack);
     dst += 16;
     src += 32;  // 32 bytes
  } while(dst < end_dst);
}

在没有AVX的情况下,每个16B向量结果的内部循环需要6条指令(6微指令). (使用AVX时,它只有5个,因为负载会折叠到和中.)由于这完全是前端的瓶颈,因此展开循环很有帮助.对于手动矢量化的版本,gcc -O3 -funroll-loops看起来非常不错,尤其是使用gcc -O3 -funroll-loops -march=sandybridge启用AVX时.

Without AVX, the inner loop takes 6 instructions (6 uops) per 16B vector of results. (With AVX it's only 5, since the load folds into the and). Since this totally bottlenecks on the front-end, loop unrolling helps a lot. gcc -O3 -funroll-loops looks pretty good for this manually-vectorized version, especially with gcc -O3 -funroll-loops -march=sandybridge to enable AVX.

使用AVX,可能值得同时使用v0v1and来减少前端瓶颈,但要以缓存行分割为代价. (以及偶尔的页面拆分).但是也许不行,这取决于用户权限,以及您的数据是否已经对齐. (这样做很值得,因为如果L1D中的数据很热,您就需要最大化缓存带宽).

With AVX, it might be worth doing both v0 and v1 with and, to reduce the front-end bottleneck at the cost of having cache-line splits. (And occasional page-splits). But maybe not, depending on the uarch, and if your data already is misaligned or not. (Branching on that could be worth it, since you need to max out cache bandwidth if data is hot in L1D).

对于AVX2,具有256b负载的256b版本应该可以在Haswell/Skylake上正常工作.在src 64B对齐的情况下,偏移量负载仍将永远不会进行高速缓存行拆分. (它将始终加载高速缓存行的字节[62:31],而v0加载将始终加载字节[31:0]).但是压缩包在128b通道内工作,因此压缩包之后,您必须改组(使用vpermq),以将64位块按正确的顺序放置.查看gcc如何使用vpackuswb ymm7, ymm5, ymm6/vpermq ymm8, ymm7, 0xD8自动矢量化标量基线版本.

With AVX2, a 256b version of this with 256b loads should work well on Haswell/Skylake. With src 64B-aligned, the offset-load will still never cache-line split. (It will always load bytes [62:31] of a cache line, and the v0 load will always load bytes [31:0]). But pack work within 128b lanes, so after the pack you have to shuffle (with vpermq) to put 64-bit chunks into the right order. Look at how gcc auto-vectorizes the scalar baseline version with vpackuswb ymm7, ymm5, ymm6 / vpermq ymm8, ymm7, 0xD8.

使用AVX512F时,此技巧将停止工作,因为必须对齐64B负载才能保留在单个64B缓存行中.但是,使用AVX512时,可以使用不同的改组,并且ALU uop吞吐量更为珍贵(在Skylake-AVX512上,其中port1在运行512b uops时关闭).因此v =加载+移位-> __m256i packed = _mm512_cvtepi16_epi8(v) 可能会很好地运行,即使它只存储256b.

With AVX512F, this trick stops working because a 64B load has to be aligned to stay within a single 64B cache line. But with AVX512, different shuffles are available, and ALU uop throughput is more precious (on Skylake-AVX512, where port1 shuts down while 512b uops are in flight). So v = load+shift -> __m256i packed = _mm512_cvtepi16_epi8(v) might work well, even though it only does 256b stores.

正确的选择可能取决于您的src和dst通常是否为64B对齐. KNL没有AVX512BW,因此这可能仅适用于Skylake-AVX512.

The right choice probably depends on whether your src and dst are usually 64B aligned. KNL doesn't have AVX512BW, so this probably only applies to Skylake-AVX512 anyway.

这篇关于快速将第二个字节复制到新的存储区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 14:25