如何优化矩阵乘法(matmul)代码以在单个处理器内核上快速运行

本文介绍了如何优化矩阵乘法(matmul)代码以在单个处理器内核上快速运行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究并行编程概念，并尝试在单核上优化矩阵乘法示例.到目前为止，我想到的最快的实现是:

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the following:

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. */    
void square_dgemm (int n, double* A, double* B, double* C)
{
  /* For each row i of A */
  for (int i = 0; i < n; ++i)
    /* For each column j of B */
    for (int j = 0; j < n; ++j) 
    {
      /* Compute C(i,j) */
      double cij = C[i+j*n];
      for( int k = 0; k < n; k++ )
          cij += A[i+k*n] * B[k+j*n];
      C[i+j*n] = cij;
    }
}

结果如下.如何减少循环并提高性能

The results are like below. how to reduce the loops and increase the performance

login4.stampede(72)$ tail -f job-naive.stdout
Size: 480       Mflop/s:  1818.89       Percentage: 18.95
Size: 511       Mflop/s:  2291.73       Percentage: 23.87
Size: 512       Mflop/s:  937.061       Percentage:  9.76
Size: 639       Mflop/s:  293.434       Percentage:  3.06
Size: 640       Mflop/s:  270.238       Percentage:  2.81
Size: 767       Mflop/s:  240.209       Percentage:  2.50
Size: 768       Mflop/s:  242.118       Percentage:  2.52
Size: 769       Mflop/s:  240.173       Percentage:  2.50
Average percentage of Peak = 22.0802
Grade = 33.1204

推荐答案

CPU上矩阵乘法的最新实现使用 GotoBLAS 算法.循环基本上按以下顺序组织:

The state-of-the-art implementation of matrix multiplication on CPUs uses GotoBLAS algorithm. Basically the loops are organized in the following order:

Loop5 for jc = 0 to N-1 in steps of NC
Loop4   for kc = 0 to K-1 in steps of KC
          //Pack KCxNC block of B
Loop3     for ic = 0 to M-1 in steps of MC
            //Pack MCxKC block of A
//--------------------Macro Kernel------------
Loop2       for jr = 0 to NC-1 in steps of NR
Loop1         for ir = 0 to MC-1 in steps of MR
//--------------------Micro Kernel------------
Loop0           for k = 0 to KC-1 in steps of 1
                //update MRxNR block of C matrix

基于矩阵乘法的现代高性能实现的一个关键见解是通过将操作数划分为针对时间局部性的块(3个最外层循环)来组织计算，并打包(复制)此类块放入连续的缓冲区中，以适合不同级别的内存以实现空间局部性(最多3个内部循环).

A key insight underlying modern high-performance implementations of matrix multiplication is to organize the computations by partitioning the operands into blocks for temporal locality (3 outer most loops), and to pack (copy) such blocksinto contiguous buffers that fit into various levels of memory for spatial locality (3 inner most loops).

上图(最初来自本文 https://github.com/flame/blislab/blob/master/tutorial.pdf"rel =" nofollow noreferrer>本教程)说明了在 BLIS .缓存阻止参数{MC，NC，KC}确定Bp(KC×NC)和Ai(MC×KC)的子矩阵大小，以便它们适合各种缓存.在计算过程中，行面板Bp被连续打包到缓冲区Bp中以适合L3高速缓存.块Ai类似地打包到缓冲区Ai中以适合二级缓存.寄存器块大小{MR，NR}与构成C的寄存器中的子矩阵有关.在微内核(最内层的循环)中，通过MR×KC和KC对更新C的小MR×NR微区块×Ai和Bp的NR条.

The above figure (originally from this paper, directly used in this tutorial) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters {MC, NC, KC} determinethe submatrix sizes of Bp (KC × NC) and Ai (MC × KC), such that they fit in various caches. During the computation, row panels Bpare contiguously packed into buffer Bp to fit in the L3 cache. Blocks Ai are similarly packed into buffer Aito fit in the L2 cache. Register block sizes {MR, NR} relate to submatrices in registers that contribute to C. In the micro-kernel (the inner most loop), a small MR × NR micro-tile of C is updated by pair of MR × KC and KC × NR slivers of Ai and Bp.

对于O(N ^ 2.87)复杂度的Strassen算法，您可能有兴趣阅读本文.其他渐近复杂度小于O(N ^ 3)的快速矩阵乘法算法可以在纸.有关实用的快速矩阵乘法算法的最近的论文.

For the Strassen's algorithm with O(N^2.87) complexity, you might be interested in reading this paper. Other fast matrix multiplication algorithms with asymptotic complexity less than O(N^3) can be easily extended in this paper. There is a recent thesis about the practical fast matrix multiplication algorithms.

如果想了解有关如何在CPU上优化矩阵乘法的更多信息，以下教程可能会有所帮助:

The following tutorials might be helpful if you want to learn more about how to optimize matrix multiplication on CPUs:

如何优化GEMM Wiki

GEMM:从纯C到SSE优化的微内核

BLISlab:用于针对CPU和ARM优化GEMM的沙箱

有关如何逐步优化CPU(使用AVX2/FMA)上的GEMM的最新文档，可以在这里下载: https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf

A most updated document about how to optimize GEMM on CPUs (with AVX2/FMA) step by step can be downloaded here:https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf

将于2019年6月开始在edX上提供的大规模开放在线课程(LAFF-On Programming for High Performance): https://github.com/ULAFF/LAFF-On-HPC http://www.cs.utexas. edu/users/flame/laff/pfhp/LAFF-On-PfHP.html

A Massive Open Online Course to be offered on edX starting in June 2019 (LAFF-On Programming for High Performance):https://github.com/ULAFF/LAFF-On-HPChttp://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html

这篇关于如何优化矩阵乘法(matmul)代码以在单个处理器内核上快速运行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！