问题描述
我目前正在编写一些针对英特尔即将推出的支持 512 位操作的 AVX-512 SIMD 指令的代码.
I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations.
现在假设有一个由 16 个 SIMD 寄存器表示的矩阵,每个寄存器包含 16 个 32 位整数(对应于一行),我如何使用纯 SIMD 指令转置矩阵?
Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions?
已经有分别使用 SSE 和 AVX2 转置 4x4 或 8x8 矩阵的解决方案.但我不知道如何使用 AVX-512 将其扩展到 16x16.
There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512.
有什么想法吗?
推荐答案
对于使用 SIMD 的两个操作数指令,您可以证明转置 nxn
矩阵所需的操作数为 n*log_2(n)
而使用标量运算它是 O(n^2)
.实际上,稍后我将说明使用标量寄存器的读写操作次数是2*n*(n-1)
.下表显示了使用 SSE 转置 4x4
、8x8
、16x16
和 32x32
矩阵的操作次数、AVX、AVX512 和 AVX1024 与标量运算的比较
For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn
matrix is n*log_2(n)
whereas using scalar operations it's O(n^2)
. In fact, later I'll show that the number of read and write operations using the scalar registers is 2*n*(n-1)
. Below is a table showing the number of operations to transpose 4x4
, 8x8
, 16x16
, and 32x32
matrices using SSE, AVX, AVX512, and AVX1024 compared to the scalar operations
n 4(SSE) 8(AVX) 16(AVX512) 32(AVX1024)
SIMD ops 8 24 64 160
SIMD +r/w ops 16 40 96 224
Scalar r/w ops 24 112 480 1984
其中 SIMD +r/w ops 包括读取和写入操作 (n*log_2(n) + 2*n
).
where SIMD +r/w ops includes the read and write operations (n*log_2(n) + 2*n
).
SIMD 转置可以在 n*log_2(n)
操作中完成的原因是算法是:
The reason the SIMD transpose can be done in n*log_2(n)
operations is that the algorithm is:
permute n 32-bit rows
permute n 64-bit rows
...
permute n simd_width/2-bit rows
例如,对于 4x4
有 4 行,因此您必须置换 32 位通道 4 次,然后置换 64 位通道 4 次.对于 16x16
,您必须对 32 位通道、64 位通道、128 位通道和最后 256 通道进行 16 次置换.
For example, for 4x4
there are 4 rows and therefore you have to permute 32-bit lanes 4 times and then 64-bit lanes 4 times. For 16x16
you have to permute 32-bit lanes , 64-bit lanes, 128-bit lanes, and finally 256-lanes 16 times for each.
我已经展示了 8x8
可以用 AVX 进行 24 次操作.所以问题是如何在 64 次操作中使用 AVX512 为 16x16
做到这一点?一般的算法是:
I already showed that 8x8
can be done with 24 operations with AVX. So the question is how to do this for 16x16
using AVX512 in 64 operations? The general algorithm is:
interleave 32-bit lanes using
8x _mm512_unpacklo_epi32
8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
8x _mm512_unpacklo_epi64
8x _mm512_unpackhi_epi64
permute 128-bit lanes using
16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
16x _mm512_shuffle_i32x4
这是未经测试的代码
//given __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;
__m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
t0 = _mm512_unpacklo_epi32(r0,r1); // 0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29
t1 = _mm512_unpackhi_epi32(r0,r1); // 2 18 3 19 6 22 7 23 10 26 11 27 14 30 15 31
t2 = _mm512_unpacklo_epi32(r2,r3); // 32 48 33 49 ...
t3 = _mm512_unpackhi_epi32(r2,r3); // 34 50 35 51 ...
t4 = _mm512_unpacklo_epi32(r4,r5); // 64 80 65 81 ...
t5 = _mm512_unpackhi_epi32(r4,r5); // 66 82 67 83 ...
t6 = _mm512_unpacklo_epi32(r6,r7); // 96 112 97 113 ...
t7 = _mm512_unpackhi_epi32(r6,r7); // 98 114 99 115 ...
t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
te = _mm512_unpacklo_epi32(re,rf); // 228 ...
tf = _mm512_unpackhi_epi32(re,rf); // 230 ...
r0 = _mm512_unpacklo_epi64(t0,t2); // 0 16 32 48 ...
r1 = _mm512_unpackhi_epi64(t0,t2); // 1 17 33 49 ...
r2 = _mm512_unpacklo_epi64(t1,t3); // 2 18 34 49 ...
r3 = _mm512_unpackhi_epi64(t1,t3); // 3 19 35 51 ...
r4 = _mm512_unpacklo_epi64(t4,t6); // 64 80 96 112 ...
r5 = _mm512_unpackhi_epi64(t4,t6); // 65 81 97 114 ...
r6 = _mm512_unpacklo_epi64(t5,t7); // 66 82 98 113 ...
r7 = _mm512_unpackhi_epi64(t5,t7); // 67 83 99 115 ...
r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...
r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ...
rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ...
rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...
t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); // 0 16 32 48 8 24 40 56 64 80 96 112 ...
t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); // 1 17 33 49 ...
t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); // 2 18 34 50 ...
t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); // 3 19 35 51 ...
t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); // 4 20 36 52 ...
t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); // 5 21 37 53 ...
t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); // 6 22 38 54 ...
t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); // 7 23 39 55 ...
t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...
r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); // 0 16 32 48 64 80 96 112 ... 240
r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); // 1 17 33 49 66 81 97 113 ... 241
r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); // 2 18 34 50 67 82 98 114 ... 242
r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); // 3 19 35 51 68 83 99 115 ... 243
r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); // 4 ...
r5 = _mm512_shuffle_i32x4(t5, td, 0x88); // 5 ...
r6 = _mm512_shuffle_i32x4(t6, te, 0x88); // 6 ...
r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); // 7 ...
r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); // 8 ...
r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); // 9 ...
ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); // 10 ...
rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); // 11 ...
rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); // 12 ...
rd = _mm512_shuffle_i32x4(t5, td, 0xdd); // 13 ...
re = _mm512_shuffle_i32x4(t6, te, 0xdd); // 14 ...
rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); // 15 31 47 63 79 96 111 127 ... 255
通过使用 _mm_shuffle_ps
(这是 MSVC 在 中使用的)转置
但不是 GCC 和 ICC).4x4
矩阵,我得到了使用 _mm512_shufflei32x4
的想法_MM_TRANSPOSE4_PS
I got the idea for using _mm512_shufflei32x4
by looking at transposing a 4x4
matrix using _mm_shuffle_ps
(which is what MSVC uses in _MM_TRANSPOSE4_PS
but not GCC and ICC).
__m128 tmp0 ,tmp1, tmp2, tmp3;
tmp0 = _mm_shuffle_ps(row0, row1, 0x88); // 0 2 4 6
tmp1 = _mm_shuffle_ps(row0, row1, 0xdd); // 1 3 5 7
tmp2 = _mm_shuffle_ps(row2, row3, 0x88); // 8 a c e
tmp3 = _mm_shuffle_ps(row2, row3, 0xdd); // 9 b d f
row0 = _mm_shuffle_ps(tmp0, tmp2, 0x88); // 0 4 8 c
row1 = _mm_shuffle_ps(tmp1, tmp3, 0x88); // 1 5 9 d
row2 = _mm_shuffle_ps(tmp0, tmp2, 0xdd); // 2 6 a e
row3 = _mm_shuffle_ps(tmp1, tmp3, 0xdd); // 3 7 b f
同样的想法适用于 _mm512_shuffle_i32x4
但现在通道是 128 位而不是 32 位,并且有 16 行而不是 4 行.
the same idea applies to _mm512_shuffle_i32x4
but now the lanes are 128-bit instead of 32-bit and there are 16 rows instead of 4 rows.
最后,为了与标量运算进行比较,我修改了 Agner Fog 的 优化 C++ 手册中的示例 9.5a一个>
Finally, to compare to scalar operations I modified Example 9.5a from Agner Fog's optimizing C++ manual
#define SIZE 16
void transpose(int a[SIZE][SIZE]) { // function to transpose matrix
// define a macro to swap two array elements:
#define swapd(x,y) {temp=x; x=y; y=temp;}
int r, c; int temp;
for (r = 1; r < SIZE; r++) {
for (c = 0; c < r; c++) {
swapd(a[r][c], a[c][r]);
}
}
}
这会进行 n*(n-1)/2
交换(因为不需要交换对角线).从组装到 16x16 的交换看起来像
this does n*(n-1)/2
swaps (because the diagonal does not need to be swapped). The swaps from assembly for 16x16 look like
mov r8d, DWORD PTR [rax+68]
mov r9d, DWORD PTR [rdx+68]
mov DWORD PTR [rax+68], r9d
mov DWORD PTR [rdx+68], r8d
所以使用标量寄存器的读/写操作的次数是2*n*(n-1)
.
so the number of read/write operations using the scalar registers is 2*n*(n-1)
.
这篇关于如何使用 SIMD 指令转置 16x16 矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!