问题描述
在对我的反向传播算法进行分析后,我了解到它负责占用我60%的计算时间.在开始研究并行替代方案之前,我想看看是否还有其他可以做的事情.
After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time.Before I start looking at parallel alternatives, I would like to see if there is anything further I can do.
activate(const double input[])
函数被配置为仅占用〜5%的时间.gradient(const double input)
函数的实现如下:
The activate(const double input[])
function is profiled to only take ~5% of the time.The gradient(const double input)
function is implemented as follows:
inline double gradient(const double input) { return (1 - (input * input)); }
有问题的训练功能:
void train(const vector<double>& data, const vector<double>& desired, const double learn_rate, const double momentum) {
this->activate(data);
this->calculate_error(desired);
// adjust weights for layers
const auto n_layers = this->config.size();
const auto adjustment = (1 - momentum) * learn_rate;
for (size_t i = 1; i < n_layers; ++i) {
const auto& inputs = i - 1 > 0 ? this->outputs[i - 1] : data;
const auto n_inputs = this->config[i - 1];
const auto n_neurons = this->config[i];
for (auto j = 0; j < n_neurons; ++j) {
const auto adjusted_error = adjustment * this->errors[i][j];
for (auto k = 0; k < n_inputs; ++k) {
const auto delta = adjusted_error * inputs[k] + (momentum * this->deltas[i][j][k]);
this->deltas[i][j][k] = delta;
this->weights[i][j][k] += delta;
}
const auto delta = adjusted_error * this->bias + (momentum * this->deltas[i][j][n_inputs]);
this->deltas[i][j][n_inputs] = delta;
this->weights[i][j][n_inputs] += delta;
}
}
}
}
此问题可能更适合 https://codereview.stackexchange.com/.
推荐答案
如果您想训练/使用神经网络,就无法避免O(n ^ 2)算法.但这非常适合矢量算术.例如,通过巧妙地使用SSE或AVX,您可以将神经元分成4或8个块进行处理,并使用乘加法而不是两个单独的指令.
You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.
如果您使用现代的编译器并仔细地重新构造算法并使用正确的开关,则甚至可以让编译器为您自动对循环进行矢量化处理,但是您的行程可能会有所不同.
If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.
对于gcc,使用-O3或-ftree-vectorize激活自动矢量化.当然,您需要具有向量功能的CPU,具体取决于目标CPU,例如-march = core2 -mssse4.1或类似的东西.如果使用-ftree-vectorizer-verbose = 2,则会得到详细的说明,以及为什么不对循环进行矢量化的原因.看看 http://gcc.gnu.org/projects/tree-ssa/vectorization.html .
For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .
更好的选择当然是直接使用编译器内部函数.
Better is of course using the compiler intrinsics directly.
这篇关于如何提高性能而无需并行执行反向传播ANN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!