问题描述
当我增加/减少SGD中使用的微型批次的批次大小时,应该更改学习率吗?如果是这样,那怎么办?
When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?
作为参考,我正在与某人讨论,据说,当批量增加时,学习率应在一定程度上降低.
For reference, I was discussing with someone, and it was said that, when batch size is increased, the learning rate should be decreased by some extent.
我的理解是,当我增加批次大小时,计算得出的平均梯度会减少噪音,因此我可以保持相同的学习率或提高学习率.
My understanding is when I increase batch size, computed average gradient will be less noisy and so I either keep same learning rate or increase it.
此外,如果我使用自适应学习率优化器(如Adam或RMSProp),那么我想我可以保持学习率不变.
Also, if I use an adaptive learning rate optimizer, like Adam or RMSProp, then I guess I can leave learning rate untouched.
请,如果我弄错了,请纠正我,并对此提供任何见识.
Please,, correct me if I am mistaken and give any insight on this.
推荐答案
理论建议,将批处理大小乘以k时,应将学习率乘以sqrt(k)以使梯度期望的方差保持恒定.请参见 A的第5页.克里热夫斯基.卷积神经网络并行化的一个怪异技巧: https://arxiv.org/abs/1404.5997
Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997
但是,最近对大型微型批次进行的实验提出了一种更简单的线性缩放规则,即,使用kN的微型批次大小时,将学习率乘以k.参见 P.Goyal等人:准确的大型微型批处理SGD:1小时内训练ImageNet https ://arxiv.org/abs/1706.02677
However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN.See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677
我要说的是,使用亚当(Adam),阿达格勒(Adagrad)和其他自适应优化器,如果批量大小没有实质性变化,则学习率可能保持不变.
I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.
这篇关于随着批次大小的变化,学习率应如何变化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!