本文介绍了等效于AVX2的_mm_alignr_epi8(PALIGNR)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在SSE3中,PALIGNR指令执行以下操作:

In SSE3, the PALIGNR instruction performs the following:

我目前正在移植我的SSE4代码以使用AVX2指令,并且正在处理256位而不是128位的寄存器.天真的,我相信内在函数_mm256_alignr_epi8(VPALIGNR)仅在256位寄存器上执行与_mm_alignr_epi8相同的操作.可悲的是,事实并非如此.实际上,_mm256_alignr_epi8将256位寄存器视为2个128位寄存器,并对两个相邻的128位寄存器执行2次对齐"操作.有效执行与_mm_alignr_epi8相同的操作,但一次在2个寄存器上执行.此处最清楚地说明了它: _mm256_alignr_epi8

I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit.Naively, I believed that the intrinsics function _mm256_alignr_epi8 (VPALIGNR) performs the same operation as _mm_alignr_epi8 only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8 treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8 but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8

目前,我的解决方案是继续使用_mm_alignr_epi8,方法是将ymm(256位)寄存器拆分为两个xmm(128位)寄存器(高和低),如下所示:

Currently my solution is to keep using _mm_alignr_epi8 by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:

__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);

这可行,但是必须有更好的方法,对吗?有没有可能应该使用更通用"的AV​​X2指令来获得相同的结果?

This works, but there has to be a better way, right?Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?

推荐答案

palignr的用途是什么?如果仅是为了处理数据不对齐,则只需使用不对齐的负载即可;在现代Intel µ架构上,它们通常足够快"(并且将为您节省很多代码量).

What are you using palignr for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).

如果由于某些其他原因需要类似palignr的行为,则可以简单地利用未对齐的负载支持以无分支的方式进行操作.除非您完全受负载存储约束,否则这可能是首选的习惯用法.

If you need palignr-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

如果您可以确切地使用palignr来提供有关如何的更多信息,那么我可能会有所帮助.

If you can provide more information about how exactly you're using palignr, I can probably be more helpful.

这篇关于等效于AVX2的_mm_alignr_epi8(PALIGNR)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-30 23:10