我有一个矩阵X,在d维空间中有n列数据向量。
给定一个向量xj,v[j]是它的l1范数(所有abs(xj i)的总和),w[j]是它的l2范数(所有xji^2的总和)的平方,pj[i]是被l1和l2范数除的项的组合。最后,我需要输出:pj,v,w用于subquet应用程序。
// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
jm = j*m;
jd = j*d;
for (unsigned long i=0; i<d; ++i) {
x_abs[i] = abs(X[i+jd]);
v[j] += x_abs[i];
x_2[i] = x_abs[i]*x_abs[i];
w[j] += x_2[i];
}
for (unsigned long i=0; i<d; ++i){
pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];
}
// functionA(pj){ ... ...} for subsequent applications
}
// functionB(v, w){ ... ...} for subsequent applications
上面的算法具有O(ND)触发器/时间复杂度,有没有人可以通过使用C++中的函数函数或新的实现来加快它的速度呢?降低O(nd)的常数对我也很有帮助。
最佳答案
让我猜猜:由于你在性能方面有问题,向量的维数相当大。如果是这样的话,那么就值得考虑“CPU缓存局部性”——这方面的一些有趣的信息。
如果数据在CPU缓存中不可用,那么将它一次可用或对其进行平方运算,与CPU等待数据的时间相形见绌。
有鉴于此,您可能需要尝试以下解决方案(没有保证会提高性能-编译器在优化代码时可能实际应用这些技术)
for (unsigned long j=0; j<n; ++j) {
// use pointer arithmetic - at > -O0 the compiler will do it anyway
double *start=X+j*d, *end=X+(j+1)*d;
// this part avoid as much as possible the competition
// on CPU caches between X and v/w.
// Don't store the norms in v/w as yet, keep them in registers
double l1norm=0, l2norm=0;
for(double *src=start; src!=end; src++) {
double val=*src;
l1norm+=abs(src);
l2norm+= src*src;
}
double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
for(double *src=start, *dst=pj; src!=end; src++, dst++) {
// Yes, recomputing abs/sqr may actually save time by not
// creating competition on CPU caches with x_abs and x_2
double val=*src;
*dst = pl1*abs(val) + pl2*val*val;
}
// functionA(pj){ ... ...} for subsequent applications
// Think well if you really need v/w. If you really do,
// at least there are two values to be sent for storage into memory,
//meanwhile the CPU can actually load the next vector into cache
v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications