问题描述
在 LSTM 网络中(了解 LSTM),为什么输入门和输出门使用tanh?
In an LSTM network (Understanding LSTMs), why does the input gate and output gate use tanh?
这背后的直觉是什么?
这只是非线性变换吗?如果是,我可以将两者都更改为另一个激活函数(例如 ReLU)吗?
It is just a nonlinear transformation? If it is, can I change both to another activation function (e.g., ReLU)?
推荐答案
Sigmoid 具体来说,用作,因为它输出一个介于 0 和 1 之间的值,它可以不让流或完全流整个大门的信息.
Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates.
另一方面,为了克服梯度消失的问题,我们需要一个函数,其二阶导数在趋于零之前可以维持很长时间.Tanh
是具有上述性质的好函数.
On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh
is a good function with the above property.
一个好的神经元单元应该是有界的、容易区分的、单调的(有利于凸优化)并且易于处理.如果您考虑这些特性,那么我相信您可以使用 ReLU
代替 tanh
函数,因为它们是彼此非常好的替代品.
A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use ReLU
in place of the tanh
function since they are very good alternatives of each other.
但是在选择激活函数之前,您必须知道您的选择与其他选择相比有哪些优点和缺点.我将简要介绍一些激活函数及其优点.
But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages.
Sigmoid
数学表达式:sigmoid(z) = 1/(1 + exp(-z))
一阶导数:sigmoid'(z) = -exp(-z)/1 + exp(-z)^2
优点:
(1) The sigmoid function has all the fundamental properties of a good activation function.
Tanh
数学表达式:tanh(z) = [exp(z) - exp(-z)]/[exp(z) + exp(-z)]
一阶导数:tanh'(z) = 1 - ([exp(z) - exp(-z)]/[exp(z) + exp(-z)])^2 = 1- tanh^2(z)
优点:
(1) Often found to converge faster in practice
(2) Gradient computation is less expensive
硬黑
数学表达式:hardtanh(z) = -1 if z 1
一阶导数:hardtanh'(z) = 1 if -1
优点:
(1) Computationally cheaper than Tanh
(2) Saturate for magnitudes of z greater than 1
ReLU
数学表达式:relu(z) = max(z, 0)
一阶导数:relu'(z) = 1 if z >0;0 否则
优点:
(1) Does not saturate even for large values of z
(2) Found much success in computer vision applications
泄漏 ReLU
数学表达式:leaky(z) = max(z, k dot z) 其中 0
一阶导数:relu'(z) = 1 if z >0;k 否则
优点:
(1) Allows propagation of error for non-positive z which ReLU doesn't
这篇论文解释了一些有趣的激活函数.你可以考虑阅读一下.
This paper explains some fun activation function. You may consider to read it.
这篇关于在 LSTM 中使用 tanh 的直觉是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!