本内容为神经网络的梯度推导与代码验证系列内容的第一章,更多相关内容请见《神经网络的梯度推导与代码验证》系列介绍。
目录
1.1 数学符号
下面介绍一下本系列统一的数学符号表达:
- 对于标量,通常用小写字母来表示,例如$x,y$
- 对于向量,通常用粗体的小写字母来表示,例如$\boldsymbol{x,y}$。向量里的每个元素都是标量,通常用带角标的小写字母表示,例如$x_{1},x_{2}$。要注意的是,在数学上向量通常默认为列向量以达到数学约定上的统一,如果要表示行向量的话,需要用到转置操作,例如$\boldsymbol{x}^{T}$就表示一个行向量
- 对于矩阵,通常用大写加粗字母表示,如$\boldsymbol{A,B}$,矩阵里的元素记作$a_{ij},b_{ij}$
1.2 矩阵导数的定义和布局
根据求导的自变量和因变量是标量,向量还是矩阵,我们有9种可能的矩阵求导定义,形式上如下所示:
对于上述9种形式的导数,接下来按照理解上的难度分档进行介绍,但注意这里并不会对它们全介绍一遍,因为有一些是在本系列的推导不涉及到的,我对它们也没太深究。
-------------简单难度--------------
- 标量对标量的导数,这个没什么可说的,跳过...
- 向量/矩阵对标量的导数,定义是向量/矩阵里的每一个元素分别对这个标量求导
- 例子:
$\boldsymbol{Y}=\begin{bmatrix}x& 2x\\ x^{2} & 2\end{bmatrix}$对$x$的导数$\frac{\partial Y}{\partial x} = \left\lbrack \begin{array}{ll} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} \\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} 1 & 2 \\ {2x} & 0 \\ \end{array} \right\rbrack$
向量对标量的导数也是同理,
$\boldsymbol{y} = \left\lbrack \begin{array}{l} x \\ {2x} \\ x^{2} \\ \end{array} \right\rbrack$对$x$的导数$\frac{\partial\boldsymbol{y}}{\partial x} = \left\lbrack \begin{array}{l} \frac{\partial y_{1}}{\partial x} \\ \frac{\partial y_{2}}{\partial x} \\ \frac{\partial y_{3}}{\partial x} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} 1 \\ 2 \\ {2x} \\ \end{array} \right\rbrack$
$\boldsymbol{y}^{T} = \left\lbrack {x,~2x,~x^{2}} \right\rbrack$对$x$的导数$\frac{\partial\boldsymbol{y}^{T}}{\partial x} = \left\lbrack {\frac{\partial y_{1}}{\partial x},~\frac{\partial y_{2}}{\partial x},~\frac{\partial y_{3}}{\partial x}} \right\rbrack = \left\lbrack {1,~2,~2x} \right\rbrack$
总结一点就是,求导结果与因变量同形,这就是所谓的分母布局
- 标量对向量/矩阵的导数,定义为这个标量对向量/矩阵中的每一个元素进行求导
- 例子:
$y = x_{1} + {2x}_{2} + x_{3}^{2} + 1$对$x = \left\lbrack \begin{array}{l} \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \\ x_{3} \\ x_{4} \\ \end{array} \right\rbrack$的导数$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \frac{\partial y}{\partial x_{3}} \\ \end{array} \\ \frac{\partial y}{\partial x_{4}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} \begin{array}{l} 1 \\ 2 \\ \end{array} \\ {2x}_{3} \\ 0 \\ \end{array} \right\rbrack$
$y = x_{1} + {2x}_{2} + x_{3}^{2} + 1$对$x^{T} = \left\lbrack {x_{1},x_{2},x_{3},x_{4}} \right\rbrack$的导数$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack {\frac{\partial y}{\partial x_{1}},\frac{\partial y}{\partial x_{2}},\frac{\partial y}{\partial x_{3}},\frac{\partial y}{\partial x_{4}}} \right\rbrack = \left\lbrack {1,2,{2x}_{3},0} \right\rbrack$
$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \frac{\partial y}{\partial x_{3}} \\ \end{array} \\ \frac{\partial y}{\partial x_{4}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} \begin{array}{l} 1 \\ 2 \\ \end{array} \\ {2x}_{3} \\ 0 \\ \end{array} \right\rbrack$对$x^{T} = \left\lbrack {x_{1},x_{2},x_{3},x_{4}} \right\rbrack$的导数$\frac{\partial y}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} 1 & 2 \\ {2x}_{3} & 0 \\ \end{array} \right\rbrack$
总结一点就是,求导结果与自变量同形,这就是所谓的分子布局。所谓矩阵求导,不过是逐元素进行标量层面的求导然后排列成向量/矩阵罢了。
--------------难度稍大一点------------
- 接下来是略微复杂的向量对向量的导数。上面提到的求导,不是自变量是标量就是因变量是标量,所以无论是计算,还是求导之后的结果布局,都是比较显而易见的。而向量对向量求导我们一般定义如下:
设$\boldsymbol{y} = \left\lbrack \begin{array}{l} y_{1} \\ y_{2} \\ y_{3} \\ \end{array} \right\rbrack$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,则$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \begin{array}{l} \frac{\partial y_{1}}{\partial x_{1}} \\ \frac{\partial y_{2}}{\partial x_{1}} \\ \end{array} & \begin{array}{l} \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \\ \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} \\ \end{array} \right\rbrack$
上面这个求导得到矩阵我们称之为雅克比矩阵(重要),它的第一个维度(行)是以分子为准,第二个维度(列)是以分母为准。直观第来看,就是对分子进行横向的展开。
- 例子:
$\boldsymbol{y} = \left\lbrack \begin{array}{l} {x_{1} + x_{2}} \\ x_{1} \\ {x_{1} + x_{2}^{2}} \\ \end{array} \right\rbrack$对$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$的导数$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \begin{array}{l} \frac{\partial y_{1}}{\partial x_{1}} \\ \frac{\partial y_{2}}{\partial x_{1}} \\ \end{array} & \begin{array}{l} \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \\ \frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} \begin{array}{l} 1 \\ 1 \\ \end{array} & \begin{array}{l} 1 \\ 0 \\ \end{array} \\ 1 & {2x_{2}} \\ \end{array} \right\rbrack$
1.3 矩阵求导的优势
之所以要搞矩阵微分,当然不是吃饱了撑着,而是为了在分析大量的神经网络参数的时候不容易出错。
举个例子:
$\boldsymbol{A} = \left( \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right)$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,求$\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x}$对$\boldsymbol{x}\boldsymbol{~}$的导数
如果我们用求导的定义来解这个问题的话,我们首先需要计算出向量$\boldsymbol{y} = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {{3x}_{1} + {4x}_{2}} \\ \end{array} \right\rbrack$,然后根据1.2节中,向量对向量的求导定义,计算$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} \frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} \\ \frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{2}} \\ \end{array} \right\rbrack$,得到$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right\rbrack = \boldsymbol{A}$。
再来个例子:
$\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 4 \\ \end{array} \right\rbrack$,$\boldsymbol{x} = \left\lbrack \begin{array}{l} x_{1} \\ x_{2} \\ \end{array} \right\rbrack$,求$y = \boldsymbol{x}^{T}\boldsymbol{A}\boldsymbol{x}$对$\boldsymbol{x}$的导数,
如果我们用求导的定义来解这个问题的话,先计算出标量$y = x_{1}^{2} + 5x_{1}x_{2} + {4x}_{2}^{2}$然后根据1.2节中,标量对向量的求导的定义,计算$\frac{\partial y}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} \frac{\partial y}{\partial x_{1}} \\ \frac{\partial y}{\partial x_{2}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} {2x_{1} + 5x_{2}} \\ {8x_{2} + 5x_{1}} \\ \end{array} \right\rbrack$
事实上,$\left\lbrack \begin{array}{l} {2x_{1} + 5x_{2}} \\ {8x_{2} + 5x_{1}} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {3x_{1} + 4x_{2}} \\ \end{array} \right\rbrack + \left\lbrack \begin{array}{l} {x_{1} + 3x_{2}} \\ {2x_{1} + 4x_{2}} \\ \end{array} \right\rbrack = \boldsymbol{A}^{T}x + \boldsymbol{A}x$,第二个等号不是巧合,而是可以通过矩阵求导(后面会说怎么求)直接得到的结论。
由此可见,对于第一个例子,或许我们通过定义法尚且能又快又准地写出求导结果,但对于第二例子,按照定义出发,从标量对标量求导的角度出发,计算出$y$后再对$\boldsymbol{x}$求导就显得有点繁琐了,而且还容易出错。但如果从矩阵求导的角度入手,因为是在向量/矩阵的维度上看待求导操作,所以求导的结果可以很容易写成向量和矩阵的组合,这样又高效,形式又简洁。
1.4 矩阵微分与矩阵求导
高中我们学过一元函数 的微分跟其导数的关系是下面这样的:
$df = f^{'}\left( x \right)dx$
到大学,我们在高数课本数又进一步学到了多元函数$f\left( x_{1},x_{2},x_{3} \right)$跟其导数的关系是下面这样的:
$df = \frac{\partial f}{\partial x_{1}}dx_{1} + \frac{\partial f}{\partial x_{2}}dx_{2} + \frac{\partial f}{\partial x_{2}}dx_{2}$ (1.1)
上面这个就是全微分方程的公式了(还有印象吗)
观察上式,可以发现:
$df = {\sum\limits_{i = 1}^{n}\frac{\partial f}{\partial x_{i}}}dx_{i} = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$ (1.2)
第一个等号是全微分公式,而第二个等号表达了梯度(偏导数)与微分的联系,形式上是$\frac{\partial f}{\partial\boldsymbol{x}}$与$d\boldsymbol{x}$的内积。
受此启发,我们可以推导到矩阵上:
$~df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij} = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$ (1.3)
其中第二个等号使用了矩阵迹的性质,即迹函数等于矩阵主对角线的元素之和。即:
$tr\left( {A^{T}B} \right) = {\sum\limits_{i,j}{a_{ij}b_{ij}}}$
上面这个式子左边看着挺恶心的,但右边的数学含义是非常明显的,就是两个矩阵对应元素相乘然后相加,跟向量的内积类似,这个叫矩阵的内积。
举个例子:
设$f\left( x_{11},x_{12},x_{21},x_{22} \right)$是一个多元函数。根据全微分公式,有$df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij}$成立。现在我们将上面这4个自变量排成一个矩阵$\boldsymbol{X} = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,那么按照前面给出的标量对矩阵求导的定义,有:
$\frac{\partial f}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{12}} \\ \frac{\partial f}{\partial x_{21}} & \frac{\partial f}{\partial x_{22}} \\ \end{array} \right\rbrack$
$\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X} = \left\lbrack \begin{array}{ll} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{21}} \\ \frac{\partial f}{\partial x_{12}} & \frac{\partial f}{\partial x_{22}} \\ \end{array} \right\rbrack\left\lbrack \begin{array}{ll} {dx}_{11} & {dx_{12}} \\ {dx}_{21} & {dx}_{22} \\ \end{array} \right\rbrack = \left\lbrack \begin{array}{ll} {\frac{\partial f}{\partial x_{11}}{dx}_{11} + \frac{\partial f}{\partial x_{21}}{dx}_{21}} & {\frac{\partial f}{\partial x_{11}}dx_{12} + \frac{\partial f}{\partial x_{21}}{dx}_{22}} \\ {\frac{\partial f}{\partial x_{12}}{dx}_{11} + \frac{\partial f}{\partial x_{22}}{dx}_{21}} & {\frac{\partial f}{\partial x_{12}}{dx}_{21} + \frac{\partial f}{\partial x_{22}}{dx}_{22}} \\ \end{array} \right\rbrack$
而$tr\left( ~ \right)$是求矩阵对角线元素之和,所以
$df = \frac{\partial f}{\partial x_{11}}{dx}_{11} + \frac{\partial f}{\partial x_{21}}{dx}_{21} + \frac{\partial f}{\partial x_{12}}{dx}_{12} + \frac{\partial f}{\partial x_{22}}{dx}_{22} = ~tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$成立。
1.5 矩阵微分性质归纳
我们在讨论如何使用矩阵微分来求导前,先看看矩阵微分的性质,都挺明显的:
- 微分加减法:$d\left( {\boldsymbol{X} \pm \boldsymbol{Y}} \right) = d\boldsymbol{X} \pm d\boldsymbol{Y}$
- 微分乘法:$d\left( \boldsymbol{X}\boldsymbol{Y} \right) = \left( d\boldsymbol{X} \right)\boldsymbol{Y} + \boldsymbol{X}\left( d\boldsymbol{Y} \right)$
- 微分转置:$d\left( \boldsymbol{X}^{\boldsymbol{T}} \right) = \left( d\boldsymbol{X} \right)^{T}$
- 微分的迹:$dtr\left( \boldsymbol{X} \right) = tr\left( d\boldsymbol{X} \right)$
- 微分哈达马乘积(逐元素相乘):$d\left( {X\bigodot Y} \right) = X\bigodot dY + dX\bigodot Y$,其优先级比普通矩阵相乘操作低。
- 逐元素求导:$d\sigma\left( X \right) = \sigma^{'}\left( X \right)\bigodot dX$
- 逆矩阵微分:$dX^{- 1} = - X^{- 1}dXX^{- 1}$
- 行列式微分(没用过):$d\left| X \right| = \left| X \right|tr\left( X^{- 1}dX \right)$
其中$\sigma\left( X \right)$表示的含义是对 里的所有元素都进行$\mathbf{\sigma}$函数的计算,即$\sigma\left( X \right) = \left\lbrack \begin{array}{lll} {\sigma\left\lbrack x_{11} \right\rbrack} & \cdots & {\sigma\left\lbrack x_{1n} \right\rbrack} \\ \vdots & \ddots & \vdots \\ {\sigma\left\lbrack x_{m1} \right\rbrack} & \cdots & {\sigma\left\lbrack x_{mn} \right\rbrack} \\ \end{array} \right\rbrack$,这其实就是神经网络中的数据经过激活函数的过程。
举个例子,$X = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,$dsin\left( X \right) = \left\lbrack \begin{array}{ll} {cosx_{11}dx_{11}} & {{cosx}_{12}dx_{12}} \\ {cosx_{21}dx_{21}} & {{cosx}_{22}dx_{22}} \\ \end{array} \right\rbrack = cos\left( X \right)\bigodot dX$
对于其他性质,存疑的话可以自行举例验证
1.6 标量对矩阵/向量的导数求解套路-迹技巧
我们试图利用标量(loss)对向量(神经网络某层的输出)/矩阵导数(神经网络某层的参数)和微分的联系,即公式1.2和公式1.3来计算标量对向量/矩阵的导数。如果一个标量的微分能被写成这种形式,那导数就是等号右边的转置符号下的那个部分。也就是$df = {\sum\limits_{i = 1}^{n}\frac{\partial f}{\partial x_{i}}}dx_{i} = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$和$df = {\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\frac{\partial f}{\partial x_{ij}}}}dx_{ij} = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$这两个等式中的$\frac{\partial f}{\partial\mathbf{x}}$。
在实际演练之前还有一个必用的套路需要介绍,在求标量对矩阵/向量的导数的时候非常有用,叫迹技巧。后面通过小demo就知道它为什么是必要的了。这里先列举出迹的一些性质(全部都很有用):
- 标量的迹等于自己:$tr\left( x \right) = x$
- 转置不变性:$tr\left( \boldsymbol{A}^{T} \right) = tr\left( \boldsymbol{A} \right)$
- 轮换不变性:$tr\left( {\boldsymbol{A}\boldsymbol{B}} \right) = tr\left( {\boldsymbol{B}\boldsymbol{A}} \right)$,其中 与 尺寸相同(这个是显然的,否则维度不相容)。两侧都等于$\sum\limits_{i,j}{\boldsymbol{A}_{ij}\boldsymbol{B}_{ij}}$
- 加减法:$tr\left( {\boldsymbol{A} \pm \boldsymbol{B}} \right) = tr\left( \boldsymbol{A} \pm \boldsymbol{B} \right)$
- 矩阵乘法和迹交换:$tr\left( {\left( {\boldsymbol{A}\bigodot\boldsymbol{B}} \right)^{T}\boldsymbol{C}} \right) = tr\left( {\boldsymbol{A}^{T}\left( {\boldsymbol{B}\bigodot\boldsymbol{C}} \right)} \right)$,需要满足 同维度。两侧都等于${\sum\limits_{i,j}{\boldsymbol{A}_{ij}\boldsymbol{B}_{ij}}}\boldsymbol{C}_{ij}$。
标量对矩阵/向量的求导技巧总结:若标量函数$f$是矩阵$\boldsymbol{X}$经加减乘法、逆、行列式、逐元素函数等运算构成,则使用相应的运算法则对$f$求微分,再使用迹技巧给$df$套上迹并将其它项交换至$d\boldsymbol{X}$左侧,对照导数与微分的联系$df = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$,即可得到导数。
特别地,若矩阵退化为向量,对照导数与微分的联系$df = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$,即可得到导数。
举个例子:
$y = \boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)$,$\frac{\partial y}{\partial\boldsymbol{X}}$
根据迹技巧第一条:$dy = tr\left( {dy} \right) = tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right)$
根据矩阵微分性质第二条:$tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = tr\left( {{d\boldsymbol{a}}^{T}ex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)} + \boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right)$,因为是对$\boldsymbol{X}$求导,所以${d\boldsymbol{a}}^{T} = 0$,因此$tr\left( {d\left( {\boldsymbol{a}^{T}exp\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = ~tr\left( {\boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right)$
根据矩阵微分性质第五条:$tr\left( {\boldsymbol{a}^{T}dex{p\left( {\boldsymbol{X}\boldsymbol{b}} \right)}} \right) = tr\left( {\boldsymbol{a}^{T}\left( {exp\left( \boldsymbol{X}\boldsymbol{b} \right)\bigodot d\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right)$
根据迹技巧第五条:$tr\left( {\boldsymbol{a}^{T}\left( {exp\left( \boldsymbol{X}\boldsymbol{b} \right)\bigodot d\left( \boldsymbol{X}\boldsymbol{b} \right)} \right)} \right) = tr\left( {\left( {\boldsymbol{a}\bigodot exp\left( {Xb} \right)} \right)^{T}dXb} \right)$
根据迹技巧第三条:$tr\left( {\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}\boldsymbol{b}} \right) = tr\left( {\boldsymbol{b}\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}} \right)$
于是,$dy = tr\left( {\boldsymbol{b}\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)^{T}d\boldsymbol{X}} \right) = tr\left( {\left( {\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)\boldsymbol{b}^{T}} \right)^{T}d\boldsymbol{X}} \right)$,对比$df = tr\left( {\left( \frac{\partial f}{\partial\boldsymbol{X}} \right)^{T}d\boldsymbol{X}} \right)$,我们可以求得$\frac{\partial y}{\partial\boldsymbol{X}} = \boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)\boldsymbol{b}^{T}$
举个简单的例子验证一下上述矩阵求导结果:$\boldsymbol{X} = \left\lbrack \begin{array}{ll} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \end{array} \right\rbrack$,$b = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,$a = \left\lbrack \begin{array}{l} 2 \\ 3 \\ \end{array} \right\rbrack$,则$y = \left\lbrack 2,3 \right\rbrack\left\lbrack \begin{array}{l} {exp\left( x_{11} + 2x_{12} \right)} \\ {exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack = 2{\exp\left( {x_{11} + 2x_{12}} \right)} + 3exp\left( x_{21} + 2x_{22} \right)$,按照基本的定义来计算,我们得到$\frac{\partial y}{\partial\boldsymbol{X}} = \left\lbrack \begin{array}{ll} {2{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} & {4{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} \\ {3exp\left( x_{21} + 2x_{22} \right)} & {6exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack$
又因为$\left( {\boldsymbol{a}\bigodot exp\left( {\boldsymbol{X}\boldsymbol{b}} \right)} \right)\boldsymbol{b}^{T} = \left\lbrack \begin{array}{l} {2exp\left( x_{11} + 2x_{12} \right)} \\ {3exp\left( x_{11} + 2x_{12} \right)} \\ \end{array} \right\rbrack\left\lbrack {1,~2} \right\rbrack = \left\lbrack \begin{array}{ll} {2{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} & {4{\exp\left\lbrack {x_{11} + 2x_{12}} \right\rbrack}} \\ {3exp\left( x_{21} + 2x_{22} \right)} & {6exp\left( x_{21} + 2x_{22} \right)} \\ \end{array} \right\rbrack$
因此证明无误。
微分法求导套路小结:
使用矩阵微分,可以在不对向量或矩阵中的某一元素单独求导再拼接,因此会比较方便,当然熟练使用的前提是对上面矩阵微分的性质,以及迹函数的性质熟练运用。还有一些场景,求导的自变量和因变量直接有复杂的多层链式求导的关系,此时微分法使用起来也有些麻烦。如果我们可以利用一些常用的简单求导结果,再使用链式求导法则,则会非常的方便。因此下一节我们讨论向量矩阵求导的链式法则。
1.7 向量微分与向量对向量求导的关系
上面讲的都是标量微分与标量对矩阵/向量的导数的关系,下面进一步拓展到向量微分与向量(神经网络某一层的输出)对向量(神经网络另一层的输出)的导数关系:
$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$ (1.4)
相比公式1.2,上式的偏导部分少了转置。
总之,先举个例子来验证一下公式1.4:
$\boldsymbol{f} = \boldsymbol{A}\boldsymbol{x}$,$\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 0 & {- 1} \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}$
解:
用定义法来解的话,先算出$\boldsymbol{f} = \left\lbrack \begin{array}{l} {x_{1} + 2x_{2}} \\ {- x_{2}} \\ \end{array} \right\rbrack$,按照1.2节中向量对向量的求导定义,得到$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll}1 & 2 \\0 & {- 1} \\\end{array} \right\rbrack$
用公式1.4来解的话,$d\boldsymbol{f} = d\boldsymbol{A}\boldsymbol{x} = \boldsymbol{A}d\boldsymbol{x}$,对比$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$,得到$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \boldsymbol{A}$,如此看来公式没有问题。
如果要从原理上理解为什么公式1.2和公式1.4差了个转置,不妨考虑对比下面两个特殊例子:
1)$f = \boldsymbol{a}^{T}\boldsymbol{x}$,$\boldsymbol{a} = \left\lbrack \begin{array}{l}1 \\2 \\\end{array} \right\rbrack$,求$\frac{\partial f}{\partial\boldsymbol{x}}$
2)$\boldsymbol{f} = \boldsymbol{a}^{T}\boldsymbol{x}$,$\boldsymbol{a} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}$
上面两个例子,第一个左边是一个标量$f = x_{1} + 2x_{2}$,而第二个左边是一个长度为1的“向量”$\boldsymbol{f} = \left\lbrack {x_{1} + 2x_{2}} \right\rbrack$。
观察第一个例子:
根据定义,可以秒求出$\frac{\partial f}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ \end{array} \right\rbrack$,如果要得到公式1.2的左边的标量$df$,那么列向量$\frac{\partial f}{\partial\boldsymbol{x}}$和列向量$d\boldsymbol{x}$必然有一方需要转置才行,否则不满足维度相容的规则,所以我们对$\frac{\partial f}{\partial\boldsymbol{x}}$进行转置从而得到$df = d\left( {x_{1} + 2x_{2}} \right) = \left\lbrack \begin{array}{l}1 \\2 \\\end{array} \right\rbrack^{T}\left\lbrack \begin{array}{l}{dx_{1}} \\{dx_{2}} \\\end{array} \right\rbrack = \left( \frac{\partial f}{\partial\boldsymbol{x}} \right)^{T}d\boldsymbol{x}$
观察第二个例子:
根据定义,可以秒求出$\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} = \left\lbrack {1,~2} \right\rbrack$,对比第一个例子,区别就在这里,因为这个地方出现区别,导致后面会有不一样的结论。
有了公式(1.2)和(1.3)能做什么?能做非常有用的事情,那就是通过写一个全微分公式,配合一些简单的矩阵微分的性质(后面有说),我们就能得到标量(神经网络的loss)对矩阵(参数矩阵)的微分了。
1.8 矩阵向量求导链式法则
终于到这一步了。
矩阵向量求导链式法则很多时候可以帮我们快速求出导数结果,但它跟标量对标量求导的链式法则不完全相同,所以需要单独讨论。
1.8.1 向量对向量求导的链式法则
首先我们来看看向量对向量求导的链式法则。假设多个向量存在依赖关系,比如三个向量$\left. \boldsymbol{x}\rightarrow\boldsymbol{y}\rightarrow\boldsymbol{z} \right.$,则存在下面的链式法则:
$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$
需要注意的是,上述链式法则只对向量间求导有效
举个例子感受一下:
$\boldsymbol{z} = {\exp\left( \boldsymbol{y} \right)},~\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x},~\boldsymbol{A} = \left\lbrack \begin{array}{ll} 1 & 2 \\ 3 & 0 \\ \end{array} \right\rbrack$,求$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}}$
解:
根据$\boldsymbol{z}$的定义式求得$\boldsymbol{z} = \left\lbrack \begin{array}{l} {exp\left( x_{1} + 2x_{2} \right)} \\ {exp\left( 3x_{1} \right)} \\ \end{array} \right\rbrack$,回顾1.2节中提到的向量对向量求导的定义,得到$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \left\lbrack \begin{array}{ll}{exp\left( x_{1} + 2x_{2} \right)} & {2exp\left( x_{1} + 2x_{2} \right)} \\{3exp\left( x_{1} + 2x_{2} \right)} & 0 \\\end{array} \right\rbrack$
如果用链式法则来求,我们先求$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}$,$d\boldsymbol{z} = dexp\left( \boldsymbol{y} \right) = {\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y}$,这里距离$d\boldsymbol{f} = \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}}d\boldsymbol{x}$这样的形式还差一点点,注意到${\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)d\boldsymbol{y}$(自行验证)。
其中$diag\left( {\exp\left( \boldsymbol{y} \right)} \right) = \left\lbrack \begin{array}{ll} {exp\left( y_{1} \right)} & 0 \\ 0 & {exp\left( y_{2} \right)} \\ \end{array} \right\rbrack$意思是将向量$\exp\left( \boldsymbol{y} \right)$的元素作为一个矩阵的对角线上的元素,其他位置全为0。所以$d\boldsymbol{z} = dexp\left( \boldsymbol{y} \right) = {\exp\left( \boldsymbol{y} \right)}\bigodot d\boldsymbol{y} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)d\boldsymbol{y}$,$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)$。
接着我们求$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$,$d\boldsymbol{y} = d\boldsymbol{A}\boldsymbol{x} = \boldsymbol{A}d\boldsymbol{x} + \boldsymbol{x}d\boldsymbol{A} = \boldsymbol{A}d\boldsymbol{x}$,所以$\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = \boldsymbol{A}$。
$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} = diag\left( {\exp\left( \boldsymbol{y} \right)} \right)\boldsymbol{A} = \left\lbrack \begin{array}{ll}{exp\left( x_{1} + 2x_{2} \right)} & 0 \\0 & {exp\left( 3x_{1} \right)} \\\end{array} \right\rbrack\left\lbrack \begin{array}{ll}1 & 2 \\3 & 0 \\\end{array} \right\rbrack = \left\lbrack \begin{array}{ll}{exp\left( x_{1} + 2x_{2} \right)} & {2exp\left( x_{1} + 2x_{2} \right)} \\{3exp\left( 3x_{1} \right)} & 0 \\\end{array} \right\rbrack$
验证无误。
1.8.2 标量对多个向量的链式求导法则
标量对多个向量的链式法则可以借助上面得到的两个有用结论推导出来:
- 结论1:$f$是一个标量的时候,如果设$\boldsymbol{f} = \left\lbrack f \right\rbrack$是一个1x1的特殊的向量,那么有$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T}$成立
- 结论2:如果$\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}$是向量,有$\frac{\partial\boldsymbol{z}}{\partial\boldsymbol{x}} = \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}$
如果$\left. \boldsymbol{x}\rightarrow\boldsymbol{y}\rightarrow f \right.$(标量),由上面第一个结论可知,$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T}$。
等号左边是标量对向量的导数,等号右边是向量对向量的导数,现在可以对右边应用结论2了,也就是向量对向量的链式法则$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T} = \left( {\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T}$。
还没完事,求导的话最有利的情况下求标量对向量的导数,因为迹技巧只有在这种情况下才有用,所以再进一步将这个特殊的向量$\boldsymbol{f}$转换回标量$f$,最终得到了下面的标量对多个向量的链式求导法则:
$\frac{\partial f}{\partial\boldsymbol{x}} = \left( \frac{\partial\boldsymbol{f}}{\partial\boldsymbol{x}} \right)^{T} = \left( {\frac{\partial\boldsymbol{f}}{\partial\boldsymbol{y}}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T} = \left( {\left( \frac{\partial f}{\partial\boldsymbol{y}} \right)^{T}\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}} \right)^{T} = \left( \frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}} \right)^{T}\frac{\partial f}{\partial\boldsymbol{y}}$
这个结论推广到一般情况是,若$\left.\boldsymbol{y}_{1}\rightarrow\boldsymbol{y}_{2}\rightarrow\boldsymbol{y}_{3}\rightarrow\ldots\boldsymbol{y}_{\boldsymbol{n}}\rightarrow z \right.$(标量),则:
$\frac{\partial z}{\partial\boldsymbol{y}_{1}} = \left( {\frac{\partial\boldsymbol{y}_{\boldsymbol{n}}}{\partial\boldsymbol{y}_{\boldsymbol{n} - 1}}\frac{\partial\boldsymbol{y}_{\boldsymbol{n} - 1}}{\partial\boldsymbol{y}_{\boldsymbol{n} - 2}}\ldots\frac{\partial\boldsymbol{y}_{2}}{\partial\boldsymbol{y}_{1}}} \right)^{T}\frac{\partial z}{\partial\boldsymbol{y}_{\boldsymbol{n}}}$
1.8.3 标量对多个矩阵的链式求导法则(证略)
下面我们再来看看标量对多个矩阵的链式求导法则,假设有这样的依赖关系$\left. \boldsymbol{X}\rightarrow\boldsymbol{Y}\rightarrow z \right.$,那我们有:
$\frac{\partial z}{\partial x_{ij}} = {\sum\limits_{k,l}{\frac{\partial z}{\partial Y_{kl}}\frac{\partial Y_{kl}}{\partial X_{ij}} = tr\left( {\left( \frac{\partial z}{\partial Y} \right)^{T}\frac{\partial Y}{\partial X_{ij}}} \right)}}$
这里大家会发现我们没有给出基于矩阵整体的链式求导法则,主要原因是矩阵对矩阵的求导是比较复杂的定义,我们目前也未涉及。因此只能给出对矩阵中一个标量的链式求导方法。这个方法并不实用,因为我们并不想每次都基于定义法来求导最后再去排列求导结果。
以实用为主的话,其实没必要对这部分深入研究下去,只需要记住一些实用的情况就行了,而且非常好记忆:
我们也可以称以下结论为“标量对线性变换的导数”
总结下就是:
- $z = f\left( \boldsymbol{Y} \right),~\boldsymbol{Y} = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{B}$,则$\frac{\partial z}{\partial\boldsymbol{X}} = A^{T}\frac{\partial z}{\partial\boldsymbol{Y}}$
- 结论在$\boldsymbol{X}$替换成向量$\boldsymbol{x}$的情况下仍然成立,$z = f\left( \boldsymbol{y} \right),\boldsymbol{y} = \boldsymbol{A}\boldsymbol{x} + \boldsymbol{B}$,则$\frac{\partial z}{\partial\boldsymbol{x}} = A^{T}\frac{\partial z}{\partial\boldsymbol{y}}$
- 结论在$\boldsymbol{X}$替换成向量$\boldsymbol{x}$的情况下仍然成立,$\frac{\partial z}{\partial\boldsymbol{x}} = A^{T}\frac{\partial z}{\partial\boldsymbol{y}}$,则$\frac{\partial z}{\partial\boldsymbol{x}} = \frac{\partial z}{\partial\boldsymbol{y}}x^{T}$
1.9 用矩阵求导来求解机器学习上的参数梯度
神经网络的求导术是学术史上的重要成果,还有个专门的名字叫做BP算法,我相信如今很多人在初次推导BP算法时也会颇费一番脑筋,事实上使用矩阵求导术来推导并不复杂。为简化起见,我们推导二层神经网络的BP算法。后面会相继系统介绍如何推导FNN,CNN,RNN和LSTM的参数求导。
我们运用上面学过的所有知识,来求分析一个二层神经网络的loss对各层参数的梯度。以经典的 MNIST 手写数字分类问题为例,这个二层神经网络输入图片拉伸成的向量$\boldsymbol{x}$,然后输出一个概率向量$\boldsymbol{y}$。用交叉熵作为loss函数可以得下面计算公式:
$l = - \boldsymbol{y}^{T}{\log{softmax\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right)}}$
其中,$\boldsymbol{x}$是个n x 1的列向量,$\boldsymbol{W}_{1}$是p x n的矩阵,$\boldsymbol{W}_{2}$是m x p的矩阵,$\boldsymbol{y}$是m x 1的列向量,$l$是标量,$\sigma$是sigmoid函数。
我们一层一层往前计算梯度:
注意到$softmax\left( \boldsymbol{x} \right) = \frac{exp\left( \boldsymbol{x} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{x} \right)}$,其中$exp\left( \boldsymbol{x} \right)$是个列向量,$ \boldsymbol{1}^{T}$是一个元素全为1的行向量,而$ \boldsymbol{1}^{T}exp\left( \boldsymbol{x} \right)$是一个标量。举个小例子,如果$\boldsymbol{x} = \left\lbrack \begin{array}{l} 1 \\ 2 \\ 3 \\ \end{array} \right\rbrack$,则$softmax\left( \mathbf{x} \right) = \left\lbrack \begin{array}{l}\frac{exp\left( 1 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\frac{exp\left( 2 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\frac{exp\left( 3 \right)}{{\exp\left(1 \right)} + {\exp\left( 2 \right)} + {\exp\left( 3 \right)}} \\\end{array} \right\rbrack$
令$\boldsymbol{a} = \boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}$,先求$\frac{\partial l}{\partial\boldsymbol{a}}$。
$dl = - \boldsymbol{y}^{T}d\left( {logsoftmax\left( \boldsymbol{a} \right)} \right) = - \boldsymbol{y}^{T}d\left( {log\left( \frac{exp\left( \boldsymbol{a} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right)} \right)$
这里要注意逐元素log满足等式$log\left( {\boldsymbol{u}/c} \right) = log\left( \boldsymbol{u} \right) - \boldsymbol{1}log\left( c \right)$,其中$\boldsymbol{u}$和$\boldsymbol{1}$是同形的列向量,$c$是标量。
因为$ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)$是一个标量,所以套用上面这个规则,有:
$log\left( \frac{exp\left( \boldsymbol{a} \right)}{ \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right) = log\left( {exp\left( \boldsymbol{a} \right)} \right) - \boldsymbol{1}log\left( { \boldsymbol{1}^{T}exp\left( \boldsymbol{a} \right)} \right)$
因而有:
$dl = - \boldsymbol{y}^{T}d\left( {log\left( {\exp\left( \boldsymbol{a} \right)} \right) - \boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\left( {\boldsymbol{a} - \boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {\boldsymbol{y}^{T}\boldsymbol{1}log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right)$
因为$\boldsymbol{y}$的元素之和等于1,所以$\boldsymbol{y}^{T}\boldsymbol{1} = 1$。进一步,我们得到:
$dl = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right)$
根据矩阵微分性质第六条:
$d\left( {log\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = {log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) \odot d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$
因为${log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$和$d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)$都是标量,所以有:
${log}^{'}\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) \odot d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right) = \frac{d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}}$
继续根据矩阵微分性质第六条做变换:$\frac{d\left( {\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} = \frac{\boldsymbol{1}^{T}d\left( {\exp\left( \boldsymbol{a} \right)} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}} = \frac{\boldsymbol{1}^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)}{\boldsymbol{1}^{T}{\exp\left( \boldsymbol{a} \right)}}$
接下来使用迹技巧第一条:
$d\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right) = tr\left( {d\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = tr\left( {1^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)} \right)$
根据迹技巧第五条:
$tr\left( {1^{T}\left( {{\exp\left( \boldsymbol{a} \right)} \odot d\boldsymbol{a}} \right)} \right) = tr\left( \left( {\left( {1{{\odot \exp}\left( \boldsymbol{a} \right)}} \right)^{T}d\boldsymbol{a}} \right) \right) = tr\left( {\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}} \right)$
再逆着用一次迹技巧第一条:
$tr\left( {\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}} \right) = \left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}$
所以有:
$dl = - \boldsymbol{y}^{T}d\boldsymbol{a} + d\left( {log\left( {1^{T}{\exp\left( \boldsymbol{a} \right)}} \right)} \right) = - \boldsymbol{y}^{T}d\boldsymbol{a} + \frac{\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}d\boldsymbol{a}}{1^{T}{\exp\left( \boldsymbol{a} \right)}} = \left( {- \boldsymbol{y}^{T} + \frac{\left( {\exp\left( \boldsymbol{a} \right)} \right)^{T}}{1^{T}{\exp\left( \boldsymbol{a} \right)}}} \right)d\boldsymbol{a}$
对照公式1.2得到$\frac{\partial l}{\partial\boldsymbol{a}} = \frac{\exp\left( \boldsymbol{a} \right)}{1^{T}{\exp\left( \boldsymbol{a} \right)}} - \boldsymbol{y} = softmax\left( \boldsymbol{a} \right) - \boldsymbol{y}$
接下来,我们求$\frac{\partial l}{\partial\boldsymbol{W}_{2}}$:
我们知道,$l = - \boldsymbol{y}^{T}{\log{softmax\left( \boldsymbol{a} \right)}}$,$\boldsymbol{a} = \boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}$
回想一下1.7.3节第四条结论,上面这个形式和$l = f\left( \boldsymbol{a} \right),\boldsymbol{a} = \boldsymbol{A}\boldsymbol{x} + \boldsymbol{b}$,求$\frac{\partial l}{\partial\boldsymbol{A}}$是一致的。
直接用1.7.3节第四条结论,我们得到$\frac{\partial l}{\partial\boldsymbol{W}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}\left( {\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)^{T}$
因为$d\boldsymbol{a} = d\boldsymbol{b}_{2} = \boldsymbol{I}d\boldsymbol{b}_{2}$,
显然$\frac{\partial l}{\partial\boldsymbol{b}_{2}} = \boldsymbol{I}^{T}\frac{\partial l}{\partial\boldsymbol{a}} = \frac{\partial l}{\partial\boldsymbol{a}}$,于是我们得到了第二层神经网络中$\boldsymbol{W}_{2}$和$\boldsymbol{b}_{2}$的梯度。
接着我们想求$\frac{\partial l}{\partial\boldsymbol{W}_{1}}$:
令$\boldsymbol{z} = \boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}$,先求出$\frac{\partial l}{\partial\boldsymbol{z}}$。
根据微分性质第六条:
$d\boldsymbol{a} = d\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right) = \boldsymbol{W}_{2}d\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) = \boldsymbol{W}_{2}\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) \odot d\boldsymbol{z}} \right)$
$\boldsymbol{W}_{2}\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right) \odot d\boldsymbol{z}} \right) = \boldsymbol{W}_{2}diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)d\boldsymbol{z}$
我们得到$\frac{\partial\boldsymbol{a}}{\partial\boldsymbol{z}} = \boldsymbol{W}_{2}diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)$,进而得到$\frac{\partial l}{\partial\boldsymbol{z}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}$
现在已知$\frac{\partial l}{\partial\boldsymbol{z}}$,$\boldsymbol{z} = \boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}$,求$\frac{\partial l}{\partial\boldsymbol{W}_{1}}$
我们直接用1.7.3节第四条结论,得到$\frac{\partial l}{\partial\boldsymbol{W}_{1}} = \frac{\partial l}{\partial\boldsymbol{z}}\boldsymbol{x}^{T}$,
按照同样的套路,得到$\frac{\partial l}{\partial\boldsymbol{b}_{1}} = \left( \frac{\partial\boldsymbol{z}}{\partial\boldsymbol{b}_{2}} \right)^{T}\frac{\partial l}{\partial\boldsymbol{z}} = \frac{\partial l}{\partial\boldsymbol{z}}$
至此我们已求完两层神经网络的参数的梯度:$\frac{\partial l}{\partial\boldsymbol{W}_{2}},\frac{\partial l}{\partial\boldsymbol{b}_{2}},~\frac{\partial l}{\partial\boldsymbol{W}_{1}},~\frac{\partial l}{\partial\boldsymbol{b}_{1}}$
$\frac{\partial l}{\partial\boldsymbol{W}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}\left( {\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right)^{T}$
$\frac{\partial l}{\partial\boldsymbol{b}_{2}} = \frac{\partial l}{\partial\boldsymbol{a}}$
$\frac{\partial l}{\partial\boldsymbol{W}_{1}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}\boldsymbol{x}^{T}$
$\frac{\partial l}{\partial\boldsymbol{b}_{1}} = diag\left( {\sigma^{'}\left( {\boldsymbol{W}_{1}\boldsymbol{x} + \boldsymbol{b}_{1}} \right)} \right){\boldsymbol{W}_{2}}^{T}\frac{\partial l}{\partial\boldsymbol{a}}$
其中,$\frac{\partial l}{\partial\boldsymbol{a}} = softmax\left( \boldsymbol{a} \right) - \boldsymbol{y}$。
可以看出,求神经网络中的参数的梯度,实际上只用求出输出层那块的梯度$\frac{\partial l}{\partial\boldsymbol{a}}$,前面隐藏层的参数梯度都只是基于输出层那部分的梯度的矩阵运算罢了。
---------推广---------
上面的推导是针对一条样本的情况,真实情况是,我们有n组样本$\left( {x_{1},y_{1}} \right),\left( {x_{2},y_{2}} \right),\ldots\left( {x_{n},y_{n}} \right)$,因此loss函数应当是
$l = {\sum\limits_{i = 1}^{n}{- {\boldsymbol{y}_{\boldsymbol{i}}}^{T}{\log{softmax\left( {\boldsymbol{W}_{2}\sigma\left( {\boldsymbol{W}_{1}\boldsymbol{x}_{\boldsymbol{i}} + \boldsymbol{b}_{1}} \right) + \boldsymbol{b}_{2}} \right)}}}}$
但这样的loss本质上依旧是个标量,并不影响推导的思路,累加符号可以放到最外面去,最后各参数的梯度等于各条样本单独计算出来的梯度再做累加。
参考资料:
- https://www.cnblogs.com/pinard/category/894690.html
- https://zhuanlan.zhihu.com/p/24709748
- https://github.com/soloice/Matrix_Derivatives
(欢迎转载,转载请注明出处。欢迎留言或沟通交流: [email protected])