c# - 缓存友好型优化:面向对象的矩阵乘法和函数内平铺矩阵乘法

在使用this implementation编写表示两个1D缓冲区中的整个矩阵的矩阵类之后，我已经达到了项目中的矩阵乘法部分，并且现在倾向于某些缓存友好的优化。偶然发现两个选项（问题在本页下部）：

1）仅在乘法时间内选择块/平铺子矩阵。

在c ++ DLL函数中完成，因此没有函数开销。
由于代码将更加复杂，因此其他优化将难以应用。

2）从子矩阵类（较小的补丁）构建矩阵类，因此通常在子矩阵类上进行乘法。

面向对象的方法为子矩阵的其他优化留出了空间。
C＃的对象标头和填充行为可以帮助克服关键的进步吗？
在多次调用之后，函数开销可能会成为问题。

矩阵乘法示例：C = A.B

 A
 1 2 3 4   is used as  1 2    3 4
 4 3 4 2               4 3    4 2
 1 3 1 2
 1 1 1 2               1 3    1 2
                       1 1    1 2


 B
 1 1 1 1  --->         1 1    1 1
 1 1 1 1               1 1    1 1
 1 1 1 1
 1 1 1 1               1 1    1 1
                       1 1    1 1


 Multiplication:       1 2 * 1 1   +   3 4 * 1 1    ==>  upper-left tile of result
                       4 3   1 1       4 2   1 1


                       same for the upper-right of result

                       1 3 * 1 1   +   1 2 * 1 1    ==> lower left tile of result
                       1 1   1 1       1 2   1 1

                       same for lower-right tile of result

        Multiplication is O(n³) but summation is O(n²).

问题：是否有人尝试过（面向功能和面向对象）并进行了性能比较？现在，我没有任何针对缓存的优化的幼稚乘法需要：

 Matrix Size   Single Threaded Time    Multithreaded Time
 * 128x128   :     5    ms                 1ms-5ms(time sample error is bigger)
 * 256x256   :     25   ms                 7     ms
 * 512x512   :     140  ms                 35    ms
 * 1024x1024 :     1.3  s                  260   ms
 * 2048x2048 :     11.3 s                  2700  ms
 * 4096x4096 :     88.1 s                  24    s
 * 8192x8192 :     710  s                  177   s

 Giga-multiplications of variables per second
               Single threaded         Multithreaded           Multi/single ratio
 * 128x128   :     0.42                    2.0 - 0.4               ?
 * 256x256   :     0.67                    2.39                   3.67x
 * 512x512   :     0.96                    3.84                   4.00x
 * 1024x1024 :     0.83                    3.47                   4.18x
 * 2048x2048 :     0.76                    3.18                   4.18x
 * 4096x4096 :     0.78                    2.86                   3.67x
 * 8192x8192 :     0.77                    3.09                   4.01x

（使用32位浮点数，带有avx优化代码的1.4GHz fx8150的平均结果）（Visual Studio C＃的Parallel.For（）中dll函数中的C ++ avx-intrinsics）

高速缓存未命中，关键步幅和其他不良情况可能会影响上述哪种尺寸的矩阵？您知道如何获得使用内在函数的性能计数器吗？

谢谢您的光临。

编辑：DLL中的内联优化：

 Matrix Size   Single Threaded Time    Multithreaded Time           Multi/Single radio
 * 128x128   :  1     ms(%400)          390us avrage in 10k iterations(6G mult /s)
 * 256x256   :  12    ms(%108 faster)   2     ms   (%250 faster)         6.0x
 * 512x512   :  73    ms(%92 faster)    15    ms   (%133 faster)         4.9x
 * 1024x1024 :  1060  ms(%22 faster)    176   ms   (%48 faster)          6.0x
 * 2048x2048 :  10070 ms(%12 faster)    2350  ms   (%15 faster)          4.3x
 * 4096x4096 :  82.0  s(%7 faster)      22    s    (%9 faster)           3.7x
 * 8192x8192 :  676   s(%5 faster)      174   s    (%2 faster)           4.1x

内联后，较小乘法的阴影性能变得可见。
仍然存在DLL函数C＃开销。 1024x1024的情况似乎是缓存丢失的起点。虽然工作仅增加了7倍，但执行时间却增加了15倍。

编辑：：本周将尝试使用面向对象方法深入研究Strassen的3层算法。主矩阵将由4个子矩阵组成。然后它们将分别由4个子子组成。然后它们将分别由4个子子子组成。这应该使几乎（8/7）（8/7）（8/7）= +％50加速。如果可行，则将DLL函数转换为补丁优化的函数，这将使用更多的缓存。

最佳答案

将Strassen算法仅应用于一层（例如256x256中的四个为512x512）作为一种面向对象的方法（超类为Strassen，子矩阵为matrix类）：

 Matrix Size   Single Threaded Time    Multithreaded Time           Multi/Single radio
 * 128x128   :  **%50 slowdown**        **slowdown**
 * 256x256   :  **%30 slowdown**        **slowdown**
 * 512x512   :  **%10 slowdown**        **slowdown**
 * 1024x1024 :  540   ms(%96 faster)    130   ms   (%35 faster)     4.15
 * 2048x2048 :  7500  ms(%34 faster)    1310  ms   (%79 faster)     5.72
 * 4096x4096 :  70.2  s(%17 faster)     17    s    (%29 faster)     4.13
 * 6144x6144 :          x               68    s
 * 8192x8192 :  outOfMemoryException    outOfMemoryException

DLL函数和C＃之间的开销仍然有效，因此小型矩阵无法更快地运行。但是，当有加速时，它总是大于8/7（％14），因为使用较小的块会更好地利用缓存。

将编写一个基准类，该类反复测试Stressen算法的不同叶子大小与朴素的叶子大小，以找到临界大小。（对于我的系统，它是512x512）。

超类将递归地构建子矩阵树，直到达到512x512的大小，并将对512x512节点使用朴素算法。然后在DLL函数中，修补/阻止算法（将在下周添加）将使其更快一些。但是我不知道如何选择适当的补丁大小，因为我不知道如何获取cpu的缓存行大小。递归Strassen完成后将进行搜索。

我的Strassen算法的实现需要五倍的内存（在上面工作）。

编辑：完成了一些递归操作，并随着resuts的到来更新表。

 Matrix Size   Single Threaded Time    Multithreaded Time
 * 2048x2048 :  x                       872     ms  average(double layer)
 * 2560x2560 :  x                       1527    ms  average(double layer)

并行进行树解析，减少内存占用并引入了完全递归性：

 Matrix Size   Single Threaded Time    Multithreaded Time                         m/s
 * 1024x1024 :  547   ms               123     ms  average(single layer)          4.45x
 * 2048x2048 :  3790  ms               790     ms  average(double layer)          4.79x
 * 4096x4096 :  26.4  s                5440    ms  average(triple layer)          4.85x
 * 8192x8192 :  185   s                38      s   average(quad layer)            4.87x
 * 8192x8192(4GHz):   x                15      s   average(quad layer)            4.87x

每秒乘法（x10 ^ 9）：

 Matrix Size   Single Threaded         Multithreaded
 * 1024x1024 :  1.71                   7.64     (single layer)
 * 2048x2048 :  1.73                   8.31     (double layer)
 * 4096x4096 :  1.74                   8.45     (triple layer)
 * 8192x8192 :  1.74                   8.48     (quad layer)
 * 8192x8192(4GHz):   x                21.49    (quad layer)
 Strassen's cpu flops is multiplied by 7/8 for each layer.

刚刚发现，使用价格类似的GPU可以使用opencl在1秒内完成8kx8k。

关于c# - 缓存友好型优化:面向对象的矩阵乘法和函数内平铺矩阵乘法，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/17785736/