问题描述
您可以找到很多好 ,用于转置随SIMD指令集的 natural 大小而变的矩阵,特别是在一行的大小不超过矢量宽度的情况下.例如,在SSE中进行4x4 float
转置,或在AVX/AVX2中进行4x4 double
或8x8 float
转置(对于AVX-512).
You can find a lot of good answers for transposing a matrix which falls with the natural size of the SIMD instruction set, in particular, where the size of one row is no more than the vector width. Examples would be a 4x4 float
transpose in SSE, or a 4x4 double
or 8x8 float
transpose in AVX/AVX2 (double everything again for AVX-512).
但是,当矩阵大于矩阵时,有哪些选择呢?例如,使用AVX2的16x16 浮点矩阵?可以完全使用SIMD改组来加快处理速度,还是仅采用聚集+顺序写入方式?
However, what are the options when the matrix is larger than that? E.g., a 16x16 float
matrix using AVX2? Can SIMD shuffles be used at all to speed things up, or is a gather + sequential write the only way?
推荐答案
如果所有矩阵尺寸都是数据包大小的倍数,则可以按块进行操作并根据需要交换块.使用SSE2的4x4双矩阵示例:
If all your matrix dimensions are a multiple of your packet-size you can do the operation block-wise and swap the blocks as needed. Example for 4x4 double matrix using SSE2:
// transpose vectors i0 and i1 and store the result to addresses r0 and r1
void transpose2x2(double *r0, double* r1, __m128d i0, __m128d i1)
{
__m128d t0 = _mm_unpacklo_pd(i0,i1);
__m128d t1 = _mm_unpackhi_pd(i0,i1);
_mm_storeu_pd(r0, t0);
_mm_storeu_pd(r1, t1);
}
void transpose(double mat[4][4])
{
// transpose [00]-block in-place
transpose2x2(mat[0]+0, mat[1]+0,_mm_loadu_pd(mat[0]+0),_mm_loadu_pd(mat[1]+0));
// load [20]-block
__m128d t20 = _mm_loadu_pd(mat[2]+0), t30 = _mm_loadu_pd(mat[3]+0);
// transpose [02]-block and store it to [20] position
transpose2x2(mat[2]+0,mat[3]+0, _mm_loadu_pd(mat[0]+2),_mm_loadu_pd(mat[1]+2));
// transpose temp-block and store it to [02] position
transpose2x2(mat[0]+2,mat[1]+2, t20, t30);
// transpose [22]-block in-place
transpose2x2(mat[2]+2, mat[3]+2,_mm_loadu_pd(mat[2]+2),_mm_loadu_pd(mat[3]+2));
}
这应该相对容易地扩展到其他平方矩阵,其他标量类型和其他体系结构.不是数据包大小倍数的矩阵可能会更复杂(如果它们足够大,那么进行矢量化的大部分工作,而只是手动处理最后几行/列,可能是值得的).
This should be relatively easy to extend to other square matrices, other scalar types and other architectures. Matrices which are not a multiple of packet sizes are perhaps more complicated (if they are large enough, it will probably be worth it to do most the work with vectorization and just do the last rows/columns manually).
对于某些尺寸,例如3x4或3x8矩阵有特殊的算法[1]-如果您有1003x1003矩阵,则可以在最后的行/列中利用它(并且可能还有其他奇数大小的算法).
For some sizes, e.g. 3x4 or 3x8 matrices there are special algorithms [1] -- if you have a 1003x1003 matrix, you could exploit that for the last rows/columns (and there are probably algorithms for other odd sizes as well).
您还可以花一些力气为矩形矩阵编写代码(必须考虑如何避免一次只缓存一个以上的块,但这是可能的).
With some effort you could also write this for rectangular matrices (some thoughts have to be made how to avoid having to cache more than one block at a time, but it is possible).
Godbolt演示: https://godbolt.org/z/tVk_Bc
Godbolt demo: https://godbolt.org/z/tVk_Bc
这篇关于行大小大于向量宽度时,SIMD转置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!