本文介绍了如何用gcc或clang模拟_mm256_loadu_epi32?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Intel的内部指南列出了内部:

  _m256i _mm256_loadu_epi32( void const * mem_addr); 
/ *
指令:vmovdqu32 ymm,m256
CPUID标志:AVX512VL + AVX512F
说明
从以下位置加载256位(由8个压缩的32位整数组成)记忆到dst。
mem_addr不需要在任何特定边界上对齐。
操作
a [255:0]:= MEM [mem_addr + 255:mem_addr]
dst [MAX:256]:= 0
* /

但是clang和gcc没有提供此内在函数。相反,它们仅提供(在文件 avx512vlintrin.h 中)

  _mm256_mask_loadu_epi32(__m256i,__mmask8,无效const *); 
_mm256_maskz_loadu_epi32(__mmask8,void const *);

可以归结为同一条指令 vmovdqu32 。我的问题:如何模拟 _mm256_loadu_epi32

 内联_m256i _mm256_loadu_epi32( void const * mem_addr)
{
/ *使用vmovdqu32的代码并使用gcc进行编译* /
}

无需编写汇编程序,即仅使用可用的内部函数?

解决方案

只需使用 _mm256_loadu_si256 像普通人一样。 AVX512内部函数唯一给您的是一个更好的原型( const void * 而不是 const __m256i * )。 / p>

@chtz建议您仍然想自己编写包装函数,以获取 void * 原型。但不要将其称为 _mm256_loadu_epi32 ;某些将来的GCC版本可能会添加它以与Intel的文档兼容,并破坏您的代码。






您甚至都不想要不屏蔽时,编译器发出 vmovdqu32 ymm vmovdqu ymm 较短,并且完全是同一件事, 。



只要没有屏蔽,使用哪个256位固有负载(对齐与未对齐除外)都没有关系。 / strong>



相关:







Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:

_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
   Instruction: vmovdqu32 ymm, m256
   CPUID Flags: AVX512VL + AVX512F
   Description
       Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
       mem_addr does not need to be aligned on any particular boundary.
   Operation
   a[255:0] := MEM[mem_addr+255:mem_addr]
   dst[MAX:256] := 0
*/

But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions

_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);

which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:

 inline _m256i _mm256_loadu_epi32(void const* mem_addr)
 {
      /* code using vmovdqu32 and compiles with gcc */
 }

without writing assembly, i.e. using only intrinsics available?

解决方案

Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*).

@chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.


You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.

I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.

But unfortunately current GCC will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This is a missed-optimization, GCC Bug 89346.

It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.

Related:

这篇关于如何用gcc或clang模拟_mm256_loadu_epi32?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 07:34