我刚开始使用cuda和magma库。我在一个测试题上试一些函数,一个二维热方程。我编写的代码似乎非常适合32、64和128的网格大小。但对于256或更大的网格大小,它产生了错误的结果。我只在这里发布了一部分代码,足以重现错误。传递最终的矩阵并在matlab中查看它表明,对magmablas_dgemm的第二次调用将错误引入到解决方案中。
有没有人能理解为什么这段代码会因为更大的网格尺寸而中断?
int main(int argc, char* argv[])
{
// Get parameters for problem set up
int side_width = atoi(argv[1]); //assuming square grid, N/32 integer
double dx = 2.0 / (side_width-1);
double dt = 0.25 * dx;
//double Tend = dt*3;// 0.5;
// create memory pointers for derivative operator matrices and solution matrix
double* U;
double* Dleft;
double* Dright;
double* dev_U;
double* dev_Dleft;
double* dev_Dright;
//initialize the MAGMA system
magma_init();
magma_int_t N = side_width;
// temp variables required by MAGMA functions
magma_int_t *piv, info, err;
piv = (magma_int_t*)malloc(N*sizeof(magma_int_t));
// Allocate memory for matrices on host and device
err = magma_dmalloc_cpu(&U, N*N);
err += magma_dmalloc_cpu(&Dleft, N*N);
err += magma_dmalloc_cpu(&Dright, N*N);
err += magma_dmalloc(&dev_U, N*N);
err += magma_dmalloc(&dev_Dleft, N*N);
err += magma_dmalloc(&dev_Dright, N*N);
if (err){
printf("error in allocation. err number = %d\n", err);
exit(1);
}
// zero out matrices (not efficient but correct)
for (int k=0; k<N*N; ++k ){
U[k] = 1.0;
Dleft[k] = 0.0;
Dright[k] = 0.0;
}
//create derivative operator matrices
double a = dt/2.0/dx/dx;
double b = dt/dx/dx;
Dleft[0] = 1.0;
Dleft[N*N-1] = 1.0;
for (int k=1; k<N-1; ++k) {
Dleft[k*N + k-1] = -a;
Dleft[k*N + k] = 1+b;
Dleft[k*N + k+1] = -a;
Dright[k*N + k-1] = a;
Dright[k*N + k] = 1-b;
Dright[k*N + k+1] = a;
}
// Determine block and thread amounts
int grid_dim = ((side_width + 31)/32) ;
int block_dim = 32;
dim3 gridDim(grid_dim, grid_dim);
dim3 blockDim(block_dim, block_dim);
//copy data from host to device
magma_dsetmatrix(N, N, U, N, dev_U, N);
magma_dsetmatrix(N, N, Dleft, N, dev_Dleft, N);
magma_dsetmatrix(N, N, Dright, N, dev_Dright, N);
// LU factorize the left hand operator matrix
magma_dgetrf_gpu(N, N, dev_Dleft, N, piv, &info);
double tn = 0; //time counter
// needed to take first step outside while loop because of some tricky transpose nonsense happening
tn += dt;
// compute explicit step : Uhat=Dright*U^T
magmablas_dgemm(MagmaTrans,MagmaNoTrans, N, N, N, 1.0f, dev_Dright, N, dev_U, N, 0.0f, dev_U, N);
// implicit step solve : Dleft*U=Uhat
magma_dgetrs_gpu(MagmaTrans, N, N, dev_Dleft, N, piv, dev_U, N, &info);
// compute explicit step : Uhat=Dright*U^T
magmablas_dgemm(MagmaTrans, MagmaTrans, N, N, N, 1.0f, dev_Dright, N, dev_U, N, 0.0f, dev_U, N);
printf("GPU matrix U at time %3.3f \n ", tn);
magma_dprint_gpu(16, 16, dev_U, N);
//copy solution from device to host
magma_dgetmatrix(N, N, dev_U, N, U, N);
//write data to file
char filename[256];
char str_t[128];
sprintf(str_t, "%d", N );
sprintf(filename, "ADI_%s.bin", str_t);
FILE* fileID = fopen(filename, "wb");
for (int i=0; i<N*N; ++i){
fwrite(&U[i],sizeof(double),1,fileID);
}
fclose(fileID);
free(U);
free(Dleft);
free(Dright);
magma_free(dev_U);
magma_free(dev_Dleft);
magma_free(dev_Dright);
free(piv);
magma_finalize();
return 0;
}
最佳答案
据我所知,BLAS/LAPACK gemm从未支持就地操作。
C := alpha*op( A )*op( B ) + beta*C
无法转换为
A := alpha*op( A )*op( B ) + beta*A
或
B := alpha*op( A )*op( B ) + beta*B
即使对于
alpha = 1, beta = 0
的规范情况,也要保证其正确性。如果您可以使用fortran,我建议您查看一下Dongarra组中的reference code。如果作为C传递的矩阵指针等于A或B,则该实现将中断。在多线程或大规模并行的BLAS实现中,这一点尤其正确。大多数并行执行环境不支持任何类型的强执行顺序或固定执行顺序。这可能意味着,由于缺乏执行顺序保证,在串行版本的线性代数例程中意外工作的操作会并行中断。如果并行BLAS或LAPACK实现中的一个例程没有明确地声明它支持就地操作,那么不要假设它不支持就地操作,因为这一切都有龙。。。
您的MAGMA gemm调用只是偶然地在较小的大小下工作,可能是因为非常小的矩阵大小没有暴露出足够的并行性来处理输入和输出指针的别名所引起的正确性问题。如果您更改代码,使输入和输出是不同的内存分配,我怀疑问题将消失。
关于c - magmablas_dgemm不适用于较大的网格,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/24174672/