问题描述
我在C ++中写了两个矩阵乘法程序:Regular MM 和Strassen的MM ,它们都对大小为2 ^ kx 2 ^ k的方阵进行操作(换句话说, )。
I wrote two Matrix Multiplications programs in C++: Regular MM (source), and Strassen's MM (source), both of which operate on square matrices of sizes 2^k x 2^k(in other words, square matrices of even size).
结果太糟糕了。对于1024 x 1024矩阵,常规MM需要 46.381秒
,而Strassen的MM需要 1484.303秒
> 25分钟 !!!!)。
Results are just terrible. For 1024 x 1024 matrix, Regular MM takes 46.381 sec
, while Strassen's MM takes 1484.303 sec
(25 minutes
!!!!).
我试图保持代码尽可能简单。其他Strassen在网络上找到的MM示例与我的代码没有太大的不同。 Strassen的代码的一个问题是显而易见的 - 我没有截止点,切换到常规MM。
I attempted to keep the code as simple as possible. Other Strassen's MM examples found on the web are not that much different from my code. One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.
我的Strassen的MM代码有什么其他问题?
What other issues my Strassen's MM code has ???
感谢!
直接链接到源
Edit1。
拳头,很多很好的建议。谢谢你花时间和分享知识。
Edit1.Fist, a lot of great advices. Thank you for taking your time and sharing knowledge.
我实现了更改(保留了我的所有代码),添加了截止点。
MM为2048x2048矩阵,截止512已经给出了良好的结果。
常规MM:191.49s
Strassen的MM:112.179s
显着改进。
结果是使用Visual Studio 2012在具有Intel Centrino处理器的史前联想X61 TabletPC上获得的结果。
我将进行更多检查(以确保获得正确的结果),并将发布结果。 p>
I implemented changes(kept all of my code), added cut-off point.MM of 2048x2048 matrix, with cutoff 512 already gives good results.Regular MM: 191.49sStrassen's MM: 112.179sSignificant improvement.Results were obtained on prehistoric Lenovo X61 TabletPC with Intel Centrino processor, using Visual Studio 2012.I will do more checks(to make sure I got the correct results), and will publish the results.
推荐答案
这是公平的说,递归到1点是大部分(如果不是整个)问题。试图猜测其他性能瓶颈而不解决这个问题几乎是不可能的,因为它带来了巨大的性能影响。 (换句话说,你正在比较苹果和橘子。)
It's fair to say that recursing down to 1 point is the bulk of (if not the entire) problem. Trying to guess at other performance bottlenecks without addressing this is almost moot due to the massive performance hit that it brings. (In other words, you're comparing Apples to Oranges.)
正如在注释中讨论的,缓存对齐可能有效果,但不是这个尺度。由于后者是缓存不存在的,所以缓存对齐可能比Strassen算法更容易损害正则算法。
As discussed in the comments, cache alignment could have an effect, but not to this scale. Furthemore, cache alignment would likely hurt the regular algorithm more than the Strassen algorithm since the latter is cache-oblivious.
void strassen(int **a, int **b, int **c, int tam) {
// trivial case: when the matrix is 1 X 1:
if (tam == 1) {
c[0][0] = a[0][0] * b[0][0];
return;
}
这太小了。虽然Strassen算法具有较小的复杂性,但它具有更大的Big-O常数。一个,你有函数调用开销一直到1元素。
That's far too small. While the Strassen algorithm has a smaller complexity, it has a much bigger Big-O constant. For one, you have function call overhead all the way down to 1 element.
这类似于使用合并或快速排序和递归到一个元素。为了提高效率,您需要在大小变小并回到经典算法时停止递归。
This is analogous to using merge or quick sort and recursing all the way down to one element. To be efficient you need to stop the recursion when the size gets small and fall back to the classic algorithm.
在快速/合并排序中,您会回到低开销 O(n ^ 2)
插入或选择排序。在这里,您将回到正常的 O(n ^ 3)
矩阵乘法。
In quick/merge sort, you'd fall back to a low-overhead O(n^2)
insertion or selection sort. Here you would fall back to the normal O(n^3)
matrix multiply.
经典算法的阈值应该是一个可调阈值,可能会因硬件和编译器优化代码的能力而有所不同。
The threshold which you fall back the classic algorithm should be a tunable threshold that will likely vary depending on the hardware and the ability of the compiler to optimize the code.
对于像Strassen乘法那样的优势只有 O(2.8074)
超过经典的 O(n ^ 3)
,不要惊讶,如果这个阈值变得非常高。 (数千个元素?)
For something like Strassen multiplication where the advantage is only O(2.8074)
over the classic O(n^3)
, don't be surprised if this threshold turns out to be very high. (thousands of elements?)
在某些应用程序中,可能有许多算法,每个都有递减的复杂性,但会增加Big-O。
In some applications there can be many algorithms each with decreasing complexity but increasing Big-O. The result is that multiple algorithms become optimal at different sizes.
大整数乘法是一个臭名昭着的例子:
Large integer multiplication is a notorious example of this:
- 成绩 - 乘法: O(N ^ 2)〜100 digit *
- : O(N ^ 1.585 )的速度超过100位数字*
- : O(N ^ 1.465)比Karatsuba快约3000位数*
- : O(> N log(N))比Karatsuba / Toom- 3 at〜700 digits *
- : : O(N log(n)比SSA快几十亿个数字?
- Grade-school Multiplication: O(N^2) optimal for < ~100 digits*
- Karatsuba Multiplication: O(N^1.585) faster than above at ~100 digits*
- Toom-Cook 3-way: O(N^1.465) faster than Karatsuba at ~3000 digits*
- Floating-point FFT: O(> N log(N)) faster than Karatsuba/Toom-3 at ~700 digits*
- Schönhage–Strassen algorithm (SSA): O(N log(n) loglog(n)) faster than FFT at ~ a billion digits*
- Fixed-width Number-Theoretic Transform: O(N log(n) faster than SSA at ~ a few billion digits?*
这篇关于为什么我的Strassen矩阵乘法慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!