本文介绍Softmax运算、Softmax损失函数及其反向传播梯度计算, 内容上承接前两篇博文 损失函数 & 手推反向传播公式。
Softmax 梯度
设有K类, 那么期望标签y形如\([0,0,...0,1,0...0]^T\)的one-hot的形式. softmax层的输出为\([a_1,a_2,...,a_j,...a_K]^T\), 其中第j类的softmax输出为:
\[\begin{align}
a_{j} &= \frac{\exp(z_{j})}{\sum_{k=1}^K \exp(z_{k})} \forall j\in 1...K \\
{\partial a_{j}\over \partial z_{j} } &= {\exp(z_{j})\cdot(\Sigma - \exp(z_{j}) )\over \Sigma^2} = a_j(1 - a_j) \\
{\partial a_{k}\over \partial z_{j} } &= { - \exp(z_{k}) \cdot \exp(z_{j}) \over \Sigma^2} = -a_j a_k \tag{$k\ne j$}
\end{align}
\]
a_{j} &= \frac{\exp(z_{j})}{\sum_{k=1}^K \exp(z_{k})} \forall j\in 1...K \\
{\partial a_{j}\over \partial z_{j} } &= {\exp(z_{j})\cdot(\Sigma - \exp(z_{j}) )\over \Sigma^2} = a_j(1 - a_j) \\
{\partial a_{k}\over \partial z_{j} } &= { - \exp(z_{k}) \cdot \exp(z_{j}) \over \Sigma^2} = -a_j a_k \tag{$k\ne j$}
\end{align}
\]
如果是全连接的DNN,那么有: \(z_{j}^{l+1}=\sum_i w_{ij} a_{i}^{l}+b_j^{l+1}\)
\(a_j^{l+1}\)可以解释成观察到的数据 \(a^l\) 属于类别 j 的概率,或者称作似然 (Likelihood)。
求输出对输入的梯度\(\partial a\over \partial z\):
\[\begin{align}
{\partial a\over \partial z_k}=
\begin{bmatrix}
{\partial a_1\over \partial z_k} \\
\vdots \\
{\partial a_k\over \partial z_k} \\
\vdots \\
{\partial a_K\over \partial z_k}
\end{bmatrix}
=
\begin{bmatrix}
-a_1 \\
\vdots \\
(1-a_k) \\
\vdots \\
-a_K
\end{bmatrix}a_k
=
(\begin{bmatrix}
0 \\
\vdots \\
1 \\
\vdots \\
0
\end{bmatrix}
-a)a_k
\end{align}
\]
{\partial a\over \partial z_k}=
\begin{bmatrix}
{\partial a_1\over \partial z_k} \\
\vdots \\
{\partial a_k\over \partial z_k} \\
\vdots \\
{\partial a_K\over \partial z_k}
\end{bmatrix}
=
\begin{bmatrix}
-a_1 \\
\vdots \\
(1-a_k) \\
\vdots \\
-a_K
\end{bmatrix}a_k
=
(\begin{bmatrix}
0 \\
\vdots \\
1 \\
\vdots \\
0
\end{bmatrix}
-a)a_k
\end{align}
\]
因此损失对输入的梯度为\({\partial E\over \partial z}\):
\[{\partial E\over \partial z_k}={\partial E\over \partial a}{\partial a\over \partial z_k}=({\partial E\over \partial a_k} - [{\partial E\over \partial a}]^T a)a_k \\
{\partial E\over \partial z}={\partial E\over \partial a}{\partial a\over \partial z}=({\partial E\over \partial a} - [{\partial E\over \partial a}]^T a)⊙ a
\]
{\partial E\over \partial z}={\partial E\over \partial a}{\partial a\over \partial z}=({\partial E\over \partial a} - [{\partial E\over \partial a}]^T a)⊙ a
\]
对应的 Caffe
中的SoftmaxLayer
的梯度反向传播计算实现代码为:
# dot 表示矩阵乘法, * 表示按对应元素相乘
bottom_diff = (top_diff - dot(top_diff, top_data)) * top_data
Softmax loss 梯度
单样本的损失函数为:
\[E = -\sum^K_{k}y_k\log(a_{k}) \\
{\partial E\over \partial a_j} = -\sum^K_{k}{y_k\over a_k}\cdot {\partial a_k\over \partial a_j}=-{y_j\over a_j}
\]
{\partial E\over \partial a_j} = -\sum^K_{k}{y_k\over a_k}\cdot {\partial a_k\over \partial a_j}=-{y_j\over a_j}
\]
接下来求E对w,b的梯度, 过程与反向传播的通用梯度计算公式相同, 这里指定了具体的激活函数(softmax)与损失函数:
\[\begin{align}
{\partial E\over \partial b_j^{l+1}} &= {\partial E\over \partial z_j^{l+1}} = \sum_k{\partial E\over \partial a_k^{l+1}} \cdot {\partial a_k^{l+1}\over \partial z_j^{l+1}} \\
&=-{y_j^{l+1}\over a_j^{l+1}} \cdot a_j^{l+1}(1 - a_j^{l+1})+\sum_{k\ne j}[-{y_k^{l+1}\over a_k^{l+1}} \cdot -a_j^{l+1} a_k^{l+1}] \\
&= -y_j^{l+1}+y_j^{l+1} a_j^{l+1} +\sum_{k\ne j}y_k^{l+1}a_j^{l+1} \\
&= a_j^{l+1}-y_j^{l+1} \\
{\partial E\over \partial w_{ij}^{l+1}} &= {\partial E\over \partial z_j^{l+1}} \cdot {\partial z_j^{l+1}\over w_{ij}^{l+1}}=(a_j^{l+1}-y_j^{l+1})a_i^l
\end{align}
\]
{\partial E\over \partial b_j^{l+1}} &= {\partial E\over \partial z_j^{l+1}} = \sum_k{\partial E\over \partial a_k^{l+1}} \cdot {\partial a_k^{l+1}\over \partial z_j^{l+1}} \\
&=-{y_j^{l+1}\over a_j^{l+1}} \cdot a_j^{l+1}(1 - a_j^{l+1})+\sum_{k\ne j}[-{y_k^{l+1}\over a_k^{l+1}} \cdot -a_j^{l+1} a_k^{l+1}] \\
&= -y_j^{l+1}+y_j^{l+1} a_j^{l+1} +\sum_{k\ne j}y_k^{l+1}a_j^{l+1} \\
&= a_j^{l+1}-y_j^{l+1} \\
{\partial E\over \partial w_{ij}^{l+1}} &= {\partial E\over \partial z_j^{l+1}} \cdot {\partial z_j^{l+1}\over w_{ij}^{l+1}}=(a_j^{l+1}-y_j^{l+1})a_i^l
\end{align}
\]
对应的 Caffe
中的SoftmaxWithLossLayer
的梯度反向传播计算实现为(\({\partial E\over \partial z}\)):
# prob_data 为前向传播时softmax的结果, label_data 是标签的one-hot表示
bottom_diff = prob_data - label_data
参考