神经网络回归的小批量选择

本文介绍了神经网络回归的小批量选择的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做具有4个特征的神经网络回归.如何确定我的问题的小批量生产规模?我看到人们使用100〜1000批处理大小的计算机视觉，每个图像具有32 * 32 * 3个功能，这是否意味着我应该使用100万批处理大小?我有数十亿的数据和数十GB的内存，因此没有硬性要求，我不这样做.

I am doing a neural network regression with 4 features. How do I determine the size of mini-batch for my problem? I see people use 100 ~ 1000 batch size for computer vision with 32*32*3 features for each image, does that mean I should use batch size of 1 million? I have billions of data and tens of GB of memory so there is no hard requirement for me not to do that.

我还观察到使用大小为〜1000的微型批处理会使收敛快于批处理大小为100万的批处理.我认为应该相反，因为以较大的批次大小计算的梯度最能代表整个样品的梯度?为什么使用小批量处理可以使收敛更快?

I also observed using a mini-batch with size ~ 1000 makes the convergence much faster than batch size of 1 million. I thought it should be the other way around, since the gradient calculated with a larger batch size is most representative of the gradient of the whole sample? Why does using mini-batch make the convergence faster?

推荐答案

来自权衡批处理大小与要训练的迭代次数神经网络:

来自Nitish Shirish Keskar，Dheevatsa Mudigere，Jorge Nocedal，Mikhail Smelyanskiy，Ping Tak Peter Tang.关于深度学习的大批量培训:泛化差距和夏普最小值. https://arxiv.org/abs/1609.04836 :

From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :

[…]

缺乏泛化能力的原因是，大批量方法趋向于收敛到训练功能的尖锐的最小化器 .这些最小化器的特征在于$ \ nabla ^ 2 f(x)$中的大正特征值，并且倾向于推广得不太好.相反，小批量方法收敛于平坦的最小化器，其特征在于$ \ nabla ^ 2 f(x)$的正特征值较小.我们已经观察到，深度神经网络的损失函数态势如此，使得大批量方法几乎总是吸引到具有尖锐最小值的区域，并且与小批量方法不同，它们无法逃脱这些最小化器的流域.

The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.

[…]

此外， Ian Goodfellow 回答为什么不使用整个训练集来计算梯度?在Quora上:

Also, some good insights from Ian Goodfellowanswering to why do not use the whole training set to compute the gradient? on Quora:

当你放在一个小批量中的m个示例中，您需要进行O(m)计算并使用 O(m)内存，但可以减少梯度中的不确定性仅乘以O(sqrt(m)).换句话说，正在减少边际收益可在微型批次中投放更多示例.你可以在深度学习教科书的第8章中了解有关此内容的更多信息，深度学习的优化算法: http://www.deeplearningbook.org/contents/optimization.html

When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)). In other words, there are diminishing marginal returns to putting more examples in the minibatch. You can read more about this in Chapter 8 of the deep learning textbook, on optimization algorithms for deep learning: http://www.deeplearningbook.org/contents/optimization.html

如果您会思考，即使使用整个训练集也并非如此给你真正的渐变.真正的渐变将是预期的在所有可能的示例中都包含期望值的梯度，由数据生成分布加权.使用整个训练集只是使用非常大的小批量，您的小批量商品数量受您在数据上花费的金额的限制集合，而不是您在计算上花费的金额.

Also, if you think about it, even using the entire training set doesn’t really give you the true gradient. The true gradient would be the expected gradient with the expectation taken over all possible examples, weighted by the data generating distribution. Using the entire training set is just using a very large minibatch size, where the size of your minibatch is limited by the amount you spend on data collection, rather than the amount you spend on computation.

这篇关于神经网络回归的小批量选择的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！