问题描述
交换 __ m128i
变量的最佳做法是什么?
背景是下的编译错误,它是一个C ++ 03编译器。 __ m128i
是与MMX和SSE指令一起使用的不透明类型,通常和 unsigned long long [2]
。 C ++ 03不支持交换数组,并且在编译器下 std:swap(__ m128i a,__m128i b)
失败。
这里有一些相关的问题没有达到标准。它们不适用,因为 std :: vector
不可用。
- a href =http://stackoverflow.com/q/26312551>如何在常量复杂度或O(1)中交换2个数组?
这听起来不是一个最佳实践的问题;它听起来像你需要一个解决方案,严重破坏的内在函数实现。如果 __ m128i tmp = a;
不编译,那很糟糕。
$ b b
如果要编写自定义交换函数,请保持简单。 __ m128i
是一种适合单一向量寄存器的POD类型。不要做任何会鼓励编译器将其溢出到内存的东西。一些编译器会产生很可怕的代码,即使是微不足道的测试用例,甚至GCC /铛可能超过memcpy的旅行作为优化大复杂功能的一部分。
由于编译器会阻塞构造函数,只需使用正常的初始化程序声明一个tmp变量,并使用 =
赋值来进行复制。这在任何支持 __ m128i
的编译器中总是有效的,并且是一种常见的模式。
内存中的值类似于 _mm_store_si128
/ _mm_load_si128
:ie 如果在未对齐的地址上使用将会出错。
// alternate names:assignment_swap
//或swap128,但是这个名称不适合__m256i ...
// __m128i t(a)错误,因此只需使用简单initializers / assignment
template< class T>
void vecswap(T& a,T& b){
// T t = a; //显然SunCC甚至窒息这
T t;
t = a;
a = b;
b = t;
}
测试用例:即使使用像ICC13这样的强大编译器,工作与memcpy版本。 asm输出从
__ m128i test_return2nd(__ m128i x,__m128i y){
vecswap(x,y);
return x;
}
MOVDQA XMM0,xmm1中的
RET#返回第二ARG,这是在将xmm1
__m128i test_return1st(__ m128i X, __m128i y){
vecswap(x,y);
return y;
}
ret#返回第一个arg,已在xmm0
使用memswap,你会得到类似
return1st_memcpy(__ m128i,__m128i):## ICC13 -O3
movdqa XMMWORD PTR [-56 + rsp],xmm0
movdqa XMMWORD PTR [-40 + rsp],xmm1#spill both
movaps xmm2,XMMWORD PTR [-56 + rsp]#reload x
MOVAPS XMMWORD PTR [-24 + RSP],XMM2#拷贝X要tmp下
MOVAPS XMM0,XMMWORD PTR [-40 + RSP]#重装Ÿ
MOVAPS XMMWORD PTR [-56 + RSP],XMM0#复制Y到X
MOVAPS XMM0,XMMWORD PTR [-24 + RSP]#重装TMP
MOVAPS XMMWORD PTR [-40 + RSP],XMM0#tmp目录复制到y
MOVDQA XMM0,XMMWORD PTR [-40 + rsp]#reload y
ret#return y
很多溢出/重装你能想象交换两个寄存器,因为icc13不优化之间的绝对量最大的内联的memcpy
■在所有的,甚至记得还剩下什么
$ b 甚至gcc使memcpy版本更糟糕的代码。它使用64位整数加载/存储而不是128位向量加载/存储进行复制。这是可怕的,如果你要加载向量(存储转发失速),否则只是坏(更多的uops做同样的工作)。
//这个编译的memcpy版本很糟糕
void test_mem(__ m128i * x,__m128i * y){
vecswap(* x,* y);
}
#GCC 5.3和ICC13做出同样的代码在这里,因为它很容易优化
MOVDQA XMM0,XMMWORD PTR [RDI]
MOVDQA将xmm1,XMMWORD PTR [RSI]
movaps XMMWORD PTR [rdi],xmm1
movaps XMMWORD PTR [rsi],xmm0
ret
// gcc 5.3使用memswap而不是vecswap。 ICC13类似
test_mem_memcpy(long long __vector(2)*,long long __vector(2)*):
mov rax,QWORD PTR [rdi]
mov rdx,QWORD PTR [rdi + 8]
mov r9,QWORD PTR [rsi]
mov r10,QWORD PTR [rsi + 8]
mov QWORD PTR [rdi],r9
mov QWORD PTR [rdi + 8],r10
mov QWORD PTR [rsi],rax
mov QWORD PTR [rsi + 8],rdx
ret
What is the best practice for swapping __m128i
variables?
The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i
is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]
. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b)
fails under the compiler.
Here are some related questions that don't quite hit the mark. They don't apply because std::vector
is not available.
- How can we swap 2 arrays in constant complexity or O(1)?
- Is it possible to swap arrays of structs
- C++03 moving a vector into a class member through constructor
This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a;
doesn't compile, that's pretty bad.
If you're going to write a custom swap function, keep it simple. __m128i
is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.
Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use =
assignment to do the copying. That always works efficiently in any compiler that supports __m128i
, and is a common pattern.
Plain assignment to/from values in memory works like _mm_store_si128
/ _mm_load_si128
: i.e. movdqa
aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)
// alternate names: assignment_swap
// or swap128, but then the name doesn't fit for __m256i...
// __m128i t(a) errors, so just use simple initializers / assignment
template<class T>
void vecswap(T& a, T& b) {
// T t = a; // Apparently SunCC even choked on this
T t;
t = a;
a = b;
b = t;
}
Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3
__m128i test_return2nd(__m128i x, __m128i y) {
vecswap(x, y);
return x;
}
movdqa xmm0, xmm1
ret # returning the 2nd arg, which was in xmm1
__m128i test_return1st(__m128i x, __m128i y) {
vecswap(x, y);
return y;
}
ret # returning the first arg, already in xmm0
With memswap, you get something like
return1st_memcpy(__m128i, __m128i): ## ICC13 -O3
movdqa XMMWORD PTR [-56+rsp], xmm0
movdqa XMMWORD PTR [-40+rsp], xmm1 # spill both
movaps xmm2, XMMWORD PTR [-56+rsp] # reload x
movaps XMMWORD PTR [-24+rsp], xmm2 # copy x to tmp
movaps xmm0, XMMWORD PTR [-40+rsp] # reload y
movaps XMMWORD PTR [-56+rsp], xmm0 # copy y to x
movaps xmm0, XMMWORD PTR [-24+rsp] # reload tmp
movaps XMMWORD PTR [-40+rsp], xmm0 # copy tmp to y
movdqa xmm0, XMMWORD PTR [-40+rsp] # reload y
ret # return y
This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpy
s at all, or even remember what is left in a register.
Swapping values already in memory
Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).
// the memcpy version of this compiles badly
void test_mem(__m128i *x, __m128i *y) {
vecswap(*x, *y);
}
# gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
movdqa xmm0, XMMWORD PTR [rdi]
movdqa xmm1, XMMWORD PTR [rsi]
movaps XMMWORD PTR [rdi], xmm1
movaps XMMWORD PTR [rsi], xmm0
ret
// gcc 5.3 with memswap instead of vecswap. ICC13 is similar
test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rdi+8]
mov r9, QWORD PTR [rsi]
mov r10, QWORD PTR [rsi+8]
mov QWORD PTR [rdi], r9
mov QWORD PTR [rdi+8], r10
mov QWORD PTR [rsi], rax
mov QWORD PTR [rsi+8], rdx
ret
这篇关于如何交换两个__m128i变量在C ++ 03给定它的不透明类型和数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!