问题描述
我有一个需要转置的矩阵(相对较大).例如假设我的矩阵是
I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
我希望结果如下:
a g m
b h n
c I o
d j p
e k q
f l r
最快的方法是什么?
推荐答案
这是个好问题.您想要在内存中实际转置矩阵而不仅仅是交换坐标的原因有很多,例如在矩阵乘法和高斯拖尾中.
This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.
首先让我列出我用于转置的一个函数(请参阅我的答案的结尾,我找到了一个更快的解决方案)
First let me list one of the functions I use for the transpose ( please see the end of my answer where I found a much faster solution)
void transpose(float *src, float *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
现在让我们看看为什么转置很有用.考虑矩阵乘法 C = A*B.我们可以这样做.
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*l+j];
}
C[K*i + j] = tmp;
}
}
然而,那样的话,将会有很多缓存未命中.一个更快的解决方案是先对 B 进行转置
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
transpose(B);
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*j+l];
}
C[K*i + j] = tmp;
}
}
transpose(B);
矩阵乘法是O(n^3),转置是O(n^2),所以转置对计算时间的影响应该可以忽略不计(对于大n
).在矩阵乘法循环中,平铺甚至比转置更有效,但要复杂得多.
Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n
). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.
我希望我知道一种更快的转置方法(我找到了一个更快的解决方案,请参阅我的答案结尾).当 Haswell/AVX2 几周后出来时,它将具有聚集功能.我不知道这在这种情况下是否会有帮助,但我可以想象收集一列并写出一行.也许它会使转置变得不必要.
I wish I knew a faster way to do the transpose ( I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.
对于高斯涂抹,您所做的是水平涂抹然后垂直涂抹.但是垂直涂抹有缓存问题所以你要做的是
For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is
Smear image horizontally
transpose output
Smear output horizontally
transpose output
Here is a paper by Intel explaining thathttp://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
最后,我在矩阵乘法(和高斯拖尾)中实际做的不是完全采用转置,而是采用特定矢量大小(例如,SSE/AVX 为 4 或 8)的宽度的转置.这是我使用的功能
Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use
void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
#pragma omp parallel for
for(int n=0; n<M*N; n++) {
int k = vec_size*(n/N/vec_size);
int i = (n/vec_size)%N;
int j = n%vec_size;
B[n] = A[M*i + k + j];
}
}
我尝试了几个函数来为大矩阵找到最快的转置.最后,最快的结果是使用带有 block_size=16
的循环阻塞(我找到了一个使用 SSE 和循环阻塞的更快解决方案 - 见下文).此代码适用于任何 NxM 矩阵(即矩阵不必是正方形).
I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16
( I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).
inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<block_size; i++) {
for(int j=0; j<block_size; j++) {
B[j*ldb + i] = A[i*lda +j];
}
}
}
inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
}
}
}
值 lda
和 ldb
是矩阵的宽度.这些需要是块大小的倍数.查找值并为例如分配内存一个 3000x1001 的矩阵我做这样的事情
The values lda
and ldb
are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this
#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);
float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
对于 3000x1001,返回 ldb = 3008
和 lda = 1008
For 3000x1001 this returns ldb = 3008
and lda = 1008
我找到了一个使用 SSE 内在函数的更快的解决方案:
I found an even faster solution using SSE intrinsics:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}
这篇关于在 C++ 中转置矩阵的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!