问题描述
取两个3×3的矩阵 A * B = C
的产物。天真这需要使用标准算法 27乘法。如果一个人是聪明的,你可以只用23次乘法做到这一点,a结果由Laderman 发现于1973年。该技术涉及节能的中间步骤,并将它们以正确的方式相结合。
Take the product of two 3x3 matrices A*B=C
. Naively this requires 27 multiplications using the standard algorithm. If one were clever, you could do this using only 23 multiplications, a result found in 1973 by Laderman. The technique involves saving intermediate steps and combining them in the right way.
现在可以修复的语言和类型,比如C ++与元素双
。如果Laderman算法是硬codeD相对于简单的双回路,可以预计一个现代编译器的性能,以排挤的算法的区别在哪里?
Now lets fix a language and a type, say C++ with elements of double
. If the Laderman algorithm was hard-coded versus the simple double loop, could we expect the performance of a modern compiler to edge out the differences of the algorithms?
注意这个问题:这是一个的节目的网站,并在此提出问题在一段时间关键型内循环的最佳实践的环境; premature的优化,这是不是。实施小费极大欢迎注释。
Notes about this question: This is a programming site, and the question is asked in the context of the best practice for a time-critical inner loop; premature optimization this is not. Tips on implementation are greatly welcomed as comments.
推荐答案
我跑的时间测试自己,结果出乎我的意料(所以为什么我要摆在首位的问题)。这样做的缺点是,在一个标准的编译 laderman
是〜225%的速度,但与 -03
优化标志是比较慢的 50%的!我有 -O3
标记或编译器的在每次添加一个随机元素到矩阵完全优化掉的简单乘法,服用时间零内部时钟precision。因为 laderman
算法是一个痛苦的检查/仔细检查我会后下面为后人完整的code。
Timing Tests:
I ran the timing tests myself and the results surprised me (hence why I asked the question in the first place). The short of it is, under a standard compile the laderman
is ~ 225% faster, but with the -03
optimizing flag it is 50% slower! I had to add a random element into the matrix each time during the -O3
flag or the compiler completely optimized away the simple multiplication, taking a time of zero within clock precision. Since the laderman
algorithm was a pain to check/double check I'll post the complete code below for posterity.
规格:Ubuntu的12.04,戴尔prevision T1600,海湾合作委员会。在计时百分比差异:
Specs: Ubuntu 12.04, Dell Prevision T1600, gcc. Percent difference in timings:
-
G ++ [2.22,2.23,2.27]
-
G ++ -O3 [-0.48,-0.49,-0.48]
-
G ++ -funroll-循环-O3 [-0.48,-0.48,-0.47]
g++ [2.22, 2.23, 2.27]
g++ -O3 [-0.48, -0.49, -0.48]
g++ -funroll-loops -O3 [-0.48, -0.48, -0.47]
标杆code以及Laderman实现:
Benchmarking code along with Laderman implementation:
#include <iostream>
#include <ctime>
#include <cstdio>
#include <cstdlib>
using namespace std;
void simple_mul(const double a[3][3],
const double b[3][3],
double c[3][3]) {
int i,j,m,n;
for(i=0;i<3;i++) {
for(j=0;j<3;j++) {
c[i][j] = 0;
for(m=0;m<3;m++)
c[i][j] += a[i][m]*b[m][j];
}
}
}
void laderman_mul(const double a[3][3],
const double b[3][3],
double c[3][3]) {
double m[24]; // not off by one, just wanted to match the index from the paper
m[1 ]= (a[0][0]+a[0][1]+a[0][2]-a[1][0]-a[1][1]-a[2][1]-a[2][2])*b[1][1];
m[2 ]= (a[0][0]-a[1][0])*(-b[0][1]+b[1][1]);
m[3 ]= a[1][1]*(-b[0][0]+b[0][1]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][2]);
m[4 ]= (-a[0][0]+a[1][0]+a[1][1])*(b[0][0]-b[0][1]+b[1][1]);
m[5 ]= (a[1][0]+a[1][1])*(-b[0][0]+b[0][1]);
m[6 ]= a[0][0]*b[0][0];
m[7 ]= (-a[0][0]+a[2][0]+a[2][1])*(b[0][0]-b[0][2]+b[1][2]);
m[8 ]= (-a[0][0]+a[2][0])*(b[0][2]-b[1][2]);
m[9 ]= (a[2][0]+a[2][1])*(-b[0][0]+b[0][2]);
m[10]= (a[0][0]+a[0][1]+a[0][2]-a[1][1]-a[1][2]-a[2][0]-a[2][1])*b[1][2];
m[11]= a[2][1]*(-b[0][0]+b[0][2]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][1]);
m[12]= (-a[0][2]+a[2][1]+a[2][2])*(b[1][1]+b[2][0]-b[2][1]);
m[13]= (a[0][2]-a[2][2])*(b[1][1]-b[2][1]);
m[14]= a[0][2]*b[2][0];
m[15]= (a[2][1]+a[2][2])*(-b[2][0]+b[2][1]);
m[16]= (-a[0][2]+a[1][1]+a[1][2])*(b[1][2]+b[2][0]-b[2][2]);
m[17]= (a[0][2]-a[1][2])*(b[1][2]-b[2][2]);
m[18]= (a[1][1]+a[1][2])*(-b[2][0]+b[2][2]);
m[19]= a[0][1]*b[1][0];
m[20]= a[1][2]*b[2][1];
m[21]= a[1][0]*b[0][2];
m[22]= a[2][0]*b[0][1];
m[23]= a[2][2]*b[2][2];
c[0][0] = m[6]+m[14]+m[19];
c[0][1] = m[1]+m[4]+m[5]+m[6]+m[12]+m[14]+m[15];
c[0][2] = m[6]+m[7]+m[9]+m[10]+m[14]+m[16]+m[18];
c[1][0] = m[2]+m[3]+m[4]+m[6]+m[14]+m[16]+m[17];
c[1][1] = m[2]+m[4]+m[5]+m[6]+m[20];
c[1][2] = m[14]+m[16]+m[17]+m[18]+m[21];
c[2][0] = m[6]+m[7]+m[8]+m[11]+m[12]+m[13]+m[14];
c[2][1] = m[12]+m[13]+m[14]+m[15]+m[22];
c[2][2] = m[6]+m[7]+m[8]+m[9]+m[23];
}
int main() {
int N = 1000000000;
double A[3][3], C[3][3];
std::clock_t t0,t1;
timespec tm0, tm1;
A[0][0] = 3/5.; A[0][1] = 1/5.; A[0][2] = 2/5.;
A[1][0] = 3/7.; A[1][1] = 1/7.; A[1][2] = 3/7.;
A[2][0] = 1/3.; A[2][1] = 1/3.; A[2][2] = 1/3.;
t0 = std::clock();
for(int i=0;i<N;i++) {
// A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
simple_mul(A,A,C);
}
t1 = std::clock();
double tdiff_simple = (t1-t0)/1000.;
cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
cout << tdiff_simple << endl;
cout << endl;
t0 = std::clock();
for(int i=0;i<N;i++) {
// A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
laderman_mul(A,A,C);
}
t1 = std::clock();
double tdiff_laderman = (t1-t0)/1000.;
cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
cout << tdiff_laderman << endl;
cout << endl;
double speedup = (tdiff_simple-tdiff_laderman)/tdiff_laderman;
cout << "Approximate speedup: " << speedup << endl;
return 0;
}
这篇关于Laderman的3×3的矩阵乘法,只有23乘法,是否值得呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!