问题描述
我注意到在训练过程中经常出现的是 NAN
被引入.
通常情况下,它似乎是由内积/全连接层或卷积层中的权重引入的.
这是因为梯度计算爆炸了吗?还是因为权重初始化(如果是这样,为什么权重初始化会有这个效果)?还是很可能是输入数据的性质造成的?
这里的首要问题很简单:在训练期间发生 NAN 的最常见原因是什么?其次,有哪些方法可以解决这个问题(以及它们为什么有效)?>
我多次遇到这种现象.以下是我的观察:
梯度爆炸
原因:大梯度会使学习过程偏离轨道.
您应该期待什么:查看运行时日志,您应该查看每次迭代的损失值.你会注意到损失开始显着从迭代到迭代增长,最终损失将太大而无法用浮点变量表示,它会变成nan
.
你能做什么:将base_lr
(在solver.prototxt中)减少一个数量级(至少).如果您有多个损失层,您应该检查日志以查看哪个层负责梯度爆炸并减少该特定层的 loss_weight
(在 train_val.prototxt 中),而不是一般的 base_lr.
糟糕的学习率策略和参数
原因:caffe 无法计算有效的学习率,而是得到 'inf'
或 'nan'
,这个无效的学习率乘以所有更新,从而使所有参数无效.
你应该期待什么:查看运行时日志,你应该看到学习率本身变成了'nan'
,例如:
... sgd_solver.cpp:106] 迭代 0,lr = -nan
您可以做什么:在您的 'solver.prototxt'
文件中修复所有影响学习率的参数.
例如,如果您使用 lr_policy: "poly"
并且忘记定义 max_iter
参数,您将得到 lr = nan
...
如需详细了解 caffe 中的学习率,请参阅此主题.
故障损失函数
原因:有时损失层中的损失计算会导致出现nan
.例如,使用非标准化值为 InfogainLoss
层提供数据,使用带有错误的自定义损失层,等
您应该期待什么: 查看运行时日志,您可能不会注意到任何异常:损失逐渐减少,并且突然出现 nan
.
你能做什么:看看你是否可以重现错误,将打印输出添加到损失层并调试错误.
例如:有一次我使用了一个损失,通过批次中标签出现的频率将惩罚标准化.碰巧的是,如果其中一个训练标签根本没有出现在批次中 - 计算的损失会产生 nan
s.在那种情况下,使用足够大的批次(相对于集合中的标签数量)就足以避免此错误.
输入错误
原因:你有一个带有 nan
的输入!
您应该期待什么:一旦学习过程达到"这个错误的输入 - 输出变成 nan
.查看运行时日志,您可能不会注意到任何异常:损失逐渐减少,并且突然出现nan
.
您可以做什么: 重新构建您的输入数据集 (lmdb/leveldn/hdf5...) 确保您的训练/验证集中没有错误的图像文件.对于调试,您可以构建一个简单的网络来读取输入层,在它上面有一个虚拟损失并贯穿所有输入:如果其中一个有问题,这个虚拟网络也应该产生 nan
.
步幅大于池化"
层中的内核大小
出于某种原因,选择stride
>用于池化的 kernel_size
可能会产生 nan
s.例如:
层{名称:faulty_pooling"类型:池化"底部:x"顶部:y"pooling_param {游泳池:AVE步幅:5内核:3}}
在 y
中带有 nan
的结果.
BatchNorm"
中的不稳定性据报道,在某些设置下,"BatchNorm"
层可能会由于数值不稳定而输出 nan
s.
这个问题是在 bvlc/caffe 和 PR #5136 正在尝试修复它.
最近,我意识到debug_info
标志:在 'solver.prototxt'
中设置 debug_info: true
将使 caffe 打印记录更多调试信息(包括梯度训练期间的幅度和激活值):这些信息可以帮助发现训练过程中的梯度膨胀和其他问题.>
I've noticed that a frequent occurrence during training is NAN
s being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan
.
What can you do: Decrease the base_lr
(in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight
(in train_val.prototxt) for that specific layer, instead of the general base_lr
.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf'
or 'nan'
instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan'
, for example:
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt'
file.
For instance, if you use lr_policy: "poly"
and you forget to define max_iter
parameter, you'll end up with lr = nan
...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nan
s to appear. For example, Feeding InfogainLoss
layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan
appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nan
s. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan
in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan
. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan
appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan
.
stride larger than kernel size in "Pooling"
layer
For some reason, choosing stride
> kernel_size
for pooling may results with nan
s. For example:
layer {
name: "faulty_pooling"
type: "Pooling"
bottom: "x"
top: "y"
pooling_param {
pool: AVE
stride: 5
kernel: 3
}
}
results with nan
s in y
.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm"
layer may output nan
s due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info
flag: setting debug_info: true
in 'solver.prototxt'
will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
这篇关于训练时nans的常见原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!