本文介绍了如何使用 debug_info 解释 caffe 日志?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练过程中遇到困难时 (nans, 损失不收敛等)有时通过设置'solver.prototxt' 文件中的debug_info: true.

训练日志看起来像:

I1109 ...] [Forward] 层数据,顶部 blob 数据数据:0.343971I1109 ...] [转发] 层 conv1,顶部 blob conv1 数据:0.0645037I1109 ...] [转发] 层 conv1,参数 blob 0 数据:0.00899114I1109 ...] [转发] 层 conv1,参数 blob 1 数据:0I1109 ...] [转发] 层 relu1,顶部 blob conv1 数据:0.0337982I1109 ...] [转发] 层 conv2,顶部 blob conv2 数据:0.0249297I1109 ...] [转发] 层 conv2,参数 blob 0 数据:0.00875855I1109 ...] [转发] 层 conv2,参数 blob 1 数据:0I1109 ...] [转发] 层 relu2,顶部 blob conv2 数据:0.0128249...I1109 ...] [转发] 层 fc1,顶部 blob fc1 数据:0.00728743I1109 ...] [转发] 层 fc1,参数 blob 0 数据:0.00876866I1109 ...] [转发] 层 fc1,参数 blob 1 数据:0I1109 ...] [转发] 层丢失,顶部 blob 丢失数据:2031.85I1109 ...] [向后] 层损失,底部 blob fc1 差异:0.124506I1109 ...] [向后] 层 fc1,底部 blob conv6 差异:0.00107067I1109 ...] [向后] 层 fc1,参数 blob 0 差异:0.483772I1109 ...] [向后] 层 fc1,参数 blob 1 差异:4079.72...I1109 ...] [向后] 层 conv2,底部 blob conv1 差异:5.99449e-06I1109 ...] [向后] 层 conv2,参数 blob 0 差异:0.00661093I1109 ...] [向后] 层 conv2,参数 blob 1 差异:0.10995I1109 ...] [向后] 层 relu1,底部 blob conv1 差异:2.87345e-06I1109 ...] [向后] 层 conv1,参数 blob 0 差异:0.0220984I1109 ...] [向后] 层 conv1,参数 blob 1 差异:0.0429201E1109 ...] [Backward] 所有净参数(数据,差异):L1 范数 = (2711.42, 7086.66);L2 范数 = (6.11659, 4085.07)

什么意思?

解决方案

乍一看可以看到这个日志部分分为两部分:[Forward][Backward].回想一下,神经网络训练是通过前向后向传播完成的:
一个训练示例(批处理)被馈送到网络,一个前向传递输出当前的预测.
基于该预测计算损失.然后导出损失,使用链规则估计和向后传播梯度.

Caffe Blob 数据结构
只是快速回顾一下.Caffe 使用Blob 数据结构来存储数据/权重/参数等.对于这个讨论,重要的是要注意Blob 有两个部分":datadiff.Blob 的值存储在 data 部分.diff 部分用于存储反向传播步骤的元素梯度.

前传

您将看到日志的这一部分列出了从下到上的所有层.对于每一层,您会看到:

I1109 ...] [Forward] Layer conv1,top blob conv1 数据:0.0645037

I1109 ...] [Forward] 层 conv1,参数 blob 0 数据:0.00899114I1109 ...] [转发] 层 conv1,参数 blob 1 数据:0

Layer conv1" 是一个卷积层,它有 2 个参数 blob:过滤器和偏差.因此,日志有三行.过滤器 blob (param blob 0) 具有 data

 I1109 ...] [Forward] 层 conv1,参数 blob 0 数据:0.00899114

即卷积滤波器权重的当前 L2 范数为 0.00899.
当前偏差(param blob 1):

 I1109 ...] [Forward] 层 conv1,参数 blob 1 数据:0

表示当前偏差设置为0.

最后但并非最不重要的是,conv1" 层有一个输出,top" 名为 conv1"(原始……).输出的L2范数是

 I1109 ...] [Forward] Layer conv1,top blob conv1 数据:0.0645037

请注意,[Forward] 传递的所有 L2 值都报告在相关 Blob 的 data 部分.

损失和梯度
[Forward] 传递的最后是损失层:

I1109 ...] [Forward] Layer loss,top blob loss 数据:2031.85

I1109 ...] [Backward] 层损失,底部 blob fc1 diff:0.124506

在这个例子中,batch loss 是 2031.85,损失的梯度 w.r.t.fc1 被计算并传递给 fc1 Blob 的 diff 部分.梯度的 L2 幅度为 0.1245.

后传
所有其余的层都在这部分从上到下列出.您可以看到现在报告的 L2 幅度属于 Blob(参数和层的输入)的 diff 部分.

终于
本次迭代的最后一行日志:

[Backward] 所有 net params (data, diff): L1 norm = (2711.42, 7086.66);L2 范数 = (6.11659, 4085.07)

报告数据和梯度的总 L1 和 L2 幅度.

我应该寻找什么?

  1. 如果您有 nans in your loss,看看你的数据或差异在什么时候变成了nan:在哪一层?在哪个迭代?

  2. 看梯度大小,应该是合理的.如果您开始看到 e+8 的值,您的数据/渐变开始爆炸.降低学习率!

  3. 看到 diff 不为零.零差异意味着没有梯度 = 没有更新 = 没有学习.如果您从随机权重开始,请考虑生成具有更高方差的随机权重.

  4. 寻找趋于零的激活(而不是梯度).如果您使用的是 "ReLU" 这意味着您的输入/权重将您带到 ReLU 门未激活"的区域;导致死神经元".考虑将您的输入标准化为零均值,添加 "BatchNorm" 层,在 ReLU 中设置 negative_slope.

When facing difficulties during training (nans, loss does not converge, etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file.

The training log then looks something like:

What does it mean?

解决方案

At first glance you can see this log section divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed.The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure
Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1):

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient
At the end of the [Forward] pass comes the loss layer:

I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).

Finally
The last log line of this iteration:

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  1. If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?

  2. Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow up. Decrease your learning rate!

  3. See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.

  4. Look for activations (rather than gradients) going to zero. If you are using "ReLU" this means your inputs/weights lead you to regions where the ReLU gates are "not active" leading to "dead neurons". Consider normalizing your inputs to have zero mean, add "BatchNorm" layers, setting negative_slope in ReLU.

这篇关于如何使用 debug_info 解释 caffe 日志?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 11:57