Softmax函数

1.导入
2.Softmax函数
- 2.1 算法简介
- 2.2 损失函数
3.Tensorflow
4. Softmax的数值稳定性
5.课后题

1.导入

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
%matplotlib widget
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

2.Softmax函数

2.1 算法简介

在softmax回归和具有softmax输出的神经网络中，生成N个输出，并选择一个输出作为预测类别。在这两种情况下，向量｛ z ｝ \mathbf｛z｝｛z｝由应用于softmax函数的线性函数生成。softmax函数将｛ z ｝ \mathbf｛z｝｛z｝转换为如下所述的概率分布。应用softmax后，每个输出将介于0和1之间，并且输出将相加到1，因此它们可以被解释为概率。较大的输入将对应于较大的输出概率。经过使用指数形式的Softmax函数能够将差距大的数值距离拉的更大。
【Machine Learning】18.Softmax函数-LMLPHP
The softmax function can be written:
a j = e z j ∑ k = 1 N e z k (1) a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1} aj=∑k=1Nezkezj(1)
其中 z i z_i zi为第i个节点的输出值，N为输出节点的个数，即分类的类别个数。The output a \mathbf{a} a is a vector of length N, so for softmax regression, you could also write:

a ( x ) = [ P ( y = 1 ∣ x ; w , b ) ⋮ P ( y = N ∣ x ; w , b ) ] = 1 ∑ k = 1 N e z k [ e z 1 ⋮ e z N ] \mathbf{a}(x)=\begin{bmatrix}P(y=1|\mathbf{x};\mathbf{w},b)\\ \vdots\\ P(y=N|\mathbf{x};\mathbf{w},b)\end{bmatrix}=\frac{1}{\sum_{k=1}^N e^{z_k}}\begin{bmatrix}e^{z_1}\\ \vdots\\ e^{z_N}\end{bmatrix} a(x)=⎣ ⎡P(y=1∣x;w,b)⋮P(y=N∣x;w,b)⎦ ⎤=∑k=1Nezk1⎣ ⎡ez1⋮ezN⎦ ⎤

输出是y=不同值的概率的向量，numpy实现如下：

def my_softmax(z):
    ez = np.exp(z)              #element-wise exponenial
    sm = ez/np.sum(ez)
    return(sm)

有几点需要注意：

softmax分子中的指数放大了数值的微小差异
输出值总和为1
softmax跨越所有输出。例如，更改“z0”将更改“a0”-“a3”的值。将其与ReLuSigmoid等具有单个输入和单个输出的其他激活进行比较。

2.2 损失函数

当使用Softmax函数作为输出节点的激活函数的时候，一般使用cross-entropy loss交叉熵作为损失函数。

逻辑回归和softmax对比：
【Machine Learning】18.Softmax函数-LMLPHP
交叉熵损失函数：
L ( a , y ) = { − l o g ( a 1 ) , if y = 1 . ⋮ − l o g ( a N ) , if y = N \begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation} L(a,y)=⎩ ⎨ ⎧−log(a1),−log(aN),if y=1.⋮if y=N(3)
其中y是本例的目标类别， a \mathbf{a} a是softmax函数的输出。特别是， a \mathbf{a} a中的值是总和为1的概率。

注意：在本课程中，loss损失是一个example，而cost涵盖了所有examples。

请注意，在上面的（3）中，只有与目标对应的行会导致损失，其他行为零。为了编写成本方程，我们需要一个“指标函数”，当指标与目标匹配时，该函数为1，否则为0。

1 { y = = n } = = { 1 , if y = = n . 0 , otherwise . \mathbf{1}\{y == n\} = =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases} 1{y==n}=={1,0,if y==n.otherwise.
Now the cost is:
J ( w , b ) = − [ ∑ i = 1 m ∑ j = 1 N 1 { y ( i ) = = j } log ⁡ e z j ( i ) ∑ k = 1 N e z k ( i ) ] \begin{align} J(\mathbf{w},b) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4} \end{align} J(w,b)=−[i=1∑mj=1∑N1{y(i)==j}log∑k=1Nezk(i)ezj(i)](4)
Where m m m is the number of examples, N N N is the number of outputs. This is the average of all the losses.

3.Tensorflow

制造数据

# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

3.1 The Obvious organization

下面的模型使用softmax作为最终致密层中的激活来实现。

损失函数在“compile”指令中单独指定。

损失函数“稀疏分类交叉熵”。上述（3）中所述的损失。在这个模型中，softmax发生在最后一层。损失函数采用作为概率向量的softmax输出。

model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

model.fit(
    X_train,y_train,
    epochs=10
)

因为softmax被集成到输出层中，所以输出是概率向量。

预测：

p_nonpreferred = model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))

3.2 preferred

3.2.1 算法简介

如果在训练过程中将softmax和loss结合起来，可以获得更稳定、更准确的结果。这是由此处显示的“preferred”组织启用的。

【Machine Learning】18.Softmax函数-LMLPHP
在preferred organization中，最终层具有线性激活函数linear activation（相当于没用激活函数）。出于历史原因，此表单中的输出称为“逻辑logits”。loss函数还有一个额外的参数：from_logits=True。这将通知损失函数，softmax操作应包含在损失计算中。这允许优化实现。

preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),# Adam一种梯度下降的算法那
)

preferred_model.fit(
    X_train,y_train,
    epochs=10
)

3.2.2 输出处理

请注意，在preferred模型中，输出不是概率，而是从大负数到大正数。当执行预期概率的预测时，必须通过softmax发送输出。

让我们看看preferred模型输出：

p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))

two example output vectors:
 [[-2.94 -2.33  2.86 -1.25]
 [ 1.5  -4.28 -7.08 -7.93]]
largest value 8.857447 smallest value -13.404879

如果期望的输出是概率，则应通过softmax.处理输出

sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))

two example output vectors:
 [[2.97e-03 5.46e-03 9.75e-01 1.62e-02]
 [9.97e-01 3.08e-03 1.86e-04 8.00e-05]]
largest value 0.99999774 smallest value 1.0387312e-07

要选择最可能的类别，不需要softmax。可以使用np.argmax().]找到最大输出的索引

for i in range(5):
    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

[-2.94 -2.33  2.86 -1.25], category: 2
[ 1.5  -4.28 -7.08 -7.93], category: 0
[ 1.02 -2.93 -5.43 -6.26], category: 0
[-2.19  3.48 -1.81 -2.91], category: 1
[-2.32 -6.31  3.67 -4.91], category: 2

argmax函数：
- y = f(t) 是一般常见的函数式，如果给定一个t值，f（t）函数式会赋一个值给y。
- y = max f(t) 代表：y 是f(t)函式所有的值中最大的output。
- y = argmax f(t) 代表：y 是f(t)函式中，会产生最大output的那个参数t。例如：
假设有一个函式 f(t)，t 的可能范围是 {0,1,2}，f(t=0) = 10 ; f(t=1) = 20 ; f(t=2) = 7，那分别对应的y如下：
- y = max f(t) = 20
- y= argmax f(t) = 1

3.3 SparseCategorialCrossentropy or CategoricalCrossEntropy

Tensorflow有两种潜在的目标值格式，损失的选择决定了预期值。

SparseCategorialCrossentropy：期望目标是与索引对应的整数。例如，如果有10个潜在目标值，y将介于0和9之间。
CategorialCrossEntropy：期望示例的目标值为一个热编码，其中目标索引处的值为1，而其他N-1项为0。一个具有10个潜在目标值的示例，其中目标值为2，将为[0,0,1,0,0,0，0,0,10]。

4. Softmax的数值稳定性

4.1 问题描述

当使用Softmax函数作为输出节点的激活函数的时候，一般使用交叉熵作为损失函数。由于Softmax函数的数值计算过程中，很容易因为输出节点的输出值比较大而发生数值溢出的现象，在计算交叉熵的时候也可能会出现数值溢出的问题。为了数值计算的稳定性，TensorFlow提供了一个统一的接口，将Softmax与交叉熵损失函数同时实现，同时也处理了数值不稳定的异常，使用TensorFlow深度学习框架的时候，一般推荐使用这个统一的接口，避免分开使用Softmax函数与交叉熵损失函数。

softmax的输入是线性层 z j = w j ⋅ x ( i ) + b z_j = \mathbf{w_j} \cdot \mathbf{x}^{(i)}+b zj=wj⋅x(i)+b的输出。值有可能太大，softmax算法的第一步计算 e z j e^{z_j} ezj。如果数字太大，这可能会导致溢出错误。
例如：

for z in [500,600,700,800]:
    ez = np.exp(z)
    zs = "{" + f"{z}" + "}"
    print(f"e^{zs} = {ez:0.2e}")
    
e^{500} = 1.40e+217
e^{600} = 3.77e+260
e^{700} = 1.01e+304
e^{800} = inf

调用前面写的mysoftmax函数，一样导致溢出

z_tmp = np.array([[500,600,700,800]])
my_softmax(z_tmp)

4.2 解决办法

Numerical stability can be improved by reducing the size of the exponent. 通过减小指数的大小可以提高数值稳定性。
Recall
e a + b = e a e b e^{a + b} = e^ae^b ea+b=eaeb
if the b b b were the opposite sign of a a a, this would reduce the size of the exponent. 如果 b b b是 a a a的相反符号，这将减小指数的大小。Specifically, if you multiplied the softmax by a fraction:
a j = e z j ∑ i = 1 N e z i e − b e − b a_j = \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-b}}{ {e^{-b}}} aj=∑i=1Neziezje−be−b
the exponent would be reduced and the value of the softmax would not change. If b b b in e b e^b eb were the largest value of the z j z_j zj’s, m a x j ( z ) max_j(\mathbf{z}) maxj(z), the exponent would be reduced to its smallest value. 指数将减小并且softmax的值将不改变。
a j = e z j ∑ i = 1 N e z i e − m a x j ( z ) e − m a x j ( z ) = e z j − m a x j ( z ) ∑ i = 1 N e z i − m a x j ( z ) \begin{align} a_j &= \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-max_j(\mathbf{z})}}{ {e^{-max_j(\mathbf{z})}}} \\ &= \frac{e^{z_j-max_j(\mathbf{z})}}{ \sum_{i=1}^{N}{e^{z_i-max_j(\mathbf{z})} }} \end{align} aj=∑i=1Neziezje−maxj(z)e−maxj(z)=∑i=1Nezi−maxj(z)ezj−maxj(z)
习惯说 C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z) 因为方程对于任何常数C都是正确的。

a j = e z j − C ∑ i = 1 N e z i − C where C = m a x j ( z ) (5) a_j = \frac{e^{z_j-C}}{ \sum_{i=1}^{N}{e^{z_i-C} }} \quad\quad\text{where}\quad C=max_j(\mathbf{z})\tag{5} aj=∑i=1Nezi−Cezj−CwhereC=maxj(z)(5)

If we look at our troublesome example where z \mathbf{z} z contains 500,600,700,800, C = m a x j ( z ) = 800 C=max_j(\mathbf{z})=800 C=maxj(z)=800

a ( x ) = 1 e 500 − 800 + e 650 + 800 + e 700 − 800 + e 2009 − 80 [ e 50 − 300 e 200 − 300 e 100 − 80 e 200 − 80 e 200 − 80 ] = [ 5.15 e − 131 1.35 e − 87 3.75 e − 44 1.0 ] \mathbf{a}(x)=\dfrac{1}{e^{500-800}+e^{650+800}+e^{700-800}+e^{2009-80}}\begin{bmatrix}e^{50-300}\\ e^{200-300}\\ e^{100-80}\\ e^{200-80}\\ e^{200-80}\end{bmatrix}=\begin{bmatrix}5.15e-131\\ 1.35e-87\\ 3.75e-44\\ 1.0\end{bmatrix} a(x)=e500−800+e650+800+e700−800+e2009−801⎣ ⎡e50−300e200−300e100−80e200−80e200−80⎦ ⎤=⎣ ⎡5.15e−1311.35e−873.75e−441.0⎦ ⎤

提高稳定性之后的softmax：

def my_softmax_ns(z):
    """numerically stablility improved"""
    bigz = np.max(z)
    ez = np.exp(z-bigz)              # minimize exponent
    sm = ez/np.sum(ez)
    return(sm)

调用：

z_tmp = np.array([500.,600,700,800])
print(tf.nn.softmax(z_tmp).numpy(), "\n", my_softmax_ns(z_tmp))

[5.15e-131 1.38e-087 3.72e-044 1.00e+000] 
 [5.15e-131 1.38e-087 3.72e-044 1.00e+000]

4.3 交叉熵损失函数的稳定性

The loss function associated with Softmax, the cross-entropy loss, is repeated here:
L ( a , y ) = { − l o g ( a 1 ) , if y = 1 . ⋮ − l o g ( a N ) , if y = N \begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \end{equation} L(a,y)=⎩ ⎨ ⎧−log(a1),−log(aN),if y=1.⋮if y=N

Where y is the target category for this example and a \mathbf{a} a is the output of a softmax function. In particular, the values in a \mathbf{a} a are probabilities that sum to one.
Let’s consider a case where the target is two ( y = 2 y=2 y=2) and just look at the loss for that case. This will result in the loss being:
其中y是本例的目标类别，｛ a ｝ \mathbf｛a｝｛a｝是softmax函数的输出。特别是，｛ a ｝ \mathbf｛a｝｛a｝中的值是总和为1的概率。
让我们考虑一个目标为2（ y ＝ 2 y＝2 y＝2）的情况，然后看看该情况下的损失。这将导致以下损失：
L ( a ) = − l o g ( a 2 ) L(\mathbf{a})= -log(a_2) L(a)=−log(a2)

Recall that a 2 a_2 a2 is the output of the softmax function described above, so this can be written: a 2 a_2 a2是上面描述的softmax函数的输出，因此可以这样写
L ( z ) = − l o g ( e z 2 ∑ i = 1 N e z i ) (6) L(\mathbf{z})= -log\left(\frac{e^{z_2}}{ \sum_{i=1}^{N}{e^{z_i} }}\right) \tag{6} L(z)=−log(∑i=1Neziez2)(6)
This can be optimized. However, to make those optimizations, the softmax and the loss must be calculated together as shown in the ‘preferred’ Tensorflow implementation you saw above.这是可以优化的。然而，要进行这些优化，softmax和损失必须一起计算，如上面看到的“preferred”方法

Starting from (6) above, the loss for the case of y=2:
l o g ( a b ) = l o g ( a ) − l o g ( b ) log(\frac{a}{b}) = log(a) - log(b) log(ba)=log(a)−log(b), so (6) can be rewritten:
L ( z ) = − [ l o g ( e z 2 ) − l o g ∑ i = 1 N e z i ] (7) L(\mathbf{z})= -\left[log(e^{z_2}) - log \sum_{i=1}^{N}{e^{z_i} }\right] \tag{7} L(z)=−[log(ez2)−logi=1∑Nezi](7)
The first term can be simplified to just z 2 z_2 z2:
L ( z ) = − [ z 2 − l o g ( ∑ i = 1 N e z i ) ] = l o g ∑ i = 1 N e z i ⏟ logsumexp() − z 2 (8) L(\mathbf{z})= -\left[z_2 - log( \sum_{i=1}^{N}{e^{z_i} })\right] = \underbrace{log \sum_{i=1}^{N}{e^{z_i} }}_\text{logsumexp()} -z_2 \tag{8} L(z)=−[z2−log(i=1∑Nezi)]=logsumexp() logi=1∑Nezi−z2(8)
It turns out that the l o g ∑ i = 1 N e z i log \sum_{i=1}^{N}{e^{z_i} } log∑i=1Nezi term in the above equation is so often used, many libraries have an implementation. In Tensorflow this is tf.math.reduce_logsumexp(). An issue with this sum is that the exponent in the sum could overflow if z i z_i zi is large. To fix this, we might like to subtract e m a x j ( z ) e^{max_j(\mathbf{z})} emaxj(z) as we did above, but this will require a bit of work:
事实证明，上述等式中的 l o g ∑ i = 1 N e z i log\sum_{i=1}^{N}{e^{z_i}} log∑i=1Nezi项经常使用，许多库都有实现。在Tensorflow中，这是tf.math.reduce_logsumexp（）。此总和的一个问题是，如果 z i z_i zi较大，则总和中的指数可能溢出。为了解决这个问题，我们可能需要像上面那样减去 e m a x j ( z ) e^{max_j(\mathbf{z})} emaxj(z)，但这需要一些工作：
l o g ∑ i = 1 N e z i = l o g ∑ i = 1 N e ( z i − m a x j ( z ) + m a x j ( z ) ) = l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) e m a x j ( z ) = l o g ( e m a x j ( z ) ) + l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) = m a x j ( z ) + l o g ∑ i = 1 N e ( z i − m a x j ( z ) ) \begin{align} log \sum_{i=1}^{N}{e^{z_i} } &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}) + max_j(\mathbf{z}))}} \tag{9}\\ &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))} e^{max_j(\mathbf{z})}} \\ &= log(e^{max_j(\mathbf{z})}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \\ &= max_j(\mathbf{z}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \end{align} logi=1∑Nezi=logi=1∑Ne(zi−maxj(z)+maxj(z))=logi=1∑Ne(zi−maxj(z))emaxj(z)=log(emaxj(z))+logi=1∑Ne(zi−maxj(z))=maxj(z)+logi=1∑Ne(zi−maxj(z))(9)
Now, the exponential is less likely to overflow. It is customary to say C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z) since the equation would be correct with any constant C. We can now write the loss equation:现在，指数不太可能溢出。习惯上说 C = m a x j ( z ) C=max_j(\mathbf{z}) C=maxj(z)，因为方程对于任何常数C都是正确的

L ( z ) = C + l o g ( ∑ i = 1 N e z i − C ) − z 2 where C = m a x j ( z ) (10) L(\mathbf{z})= C+ log( \sum_{i=1}^{N}{e^{z_i-C} }) -z_2 \;\;\;\text{where } C=max_j(\mathbf{z}) \tag{10} L(z)=C+log(i=1∑Nezi−C)−z2where C=maxj(z)(10)
A computationally simpler, more stable version of the loss. The above is for an example where the target, y=2 but generalizes to any target.
计算上更简单、更稳定的损失版本。上面是一个例子，其中目标y＝2，但一般适用于任何目标。

5.课后题

【Machine Learning】18.Softmax函数-LMLPHP

注意第二种方法最后输出使用的是linear激活函数（相当于没有）

【Machine Learning】18.Softmax函数-LMLPHP

使用adam优化器进行

【Machine Learning】18.Softmax函数-LMLPHP

卷积神经网络一个节点会重复使用多个输入值

KiraFenvy

【Machine Learning】18.Softmax函数

Softmax函数

1.导入

2.Softmax函数

2.1 算法简介

2.2 损失函数

3.Tensorflow

3.1 The Obvious organization

3.2 preferred

3.2.1 算法简介

3.2.2 输出处理

3.3 SparseCategorialCrossentropy or CategoricalCrossEntropy

4. Softmax的数值稳定性

4.1 问题描述

4.2 解决办法

4.3 交叉熵损失函数的稳定性

5.课后题