问题描述
任何非零 recurrent_dropout
都会产生 NaN 损失和权重;后者是 0 或 NaN.发生在堆叠、浅层、stateful
、return_sequences
= 任何、&没有 Bidirectional()
、activation='relu'
、loss='binary_crossentropy'
.NaN 出现在几个批次内.
有任何修复吗?感谢帮助.
尝试进行故障排除:
recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50
(经验确定),Nadam优化器activation='tanh'
- 无 NaN,权重稳定,测试多达 10 个批次lr=2e-6,2e-5
- 没有 NaN,权重稳定,最多测试 10 个批次lr=5e-5
- 没有 NaN,权重稳定,3 个批次 - 第 4 批次的 NaNbatch_shape=(32,48,16)
- 2 个批次的损失较大,第 3 批次为 NaN
注意:batch_shape=(32,672,16)
,每批次 17 次调用 train_on_batch
环境:
- Keras 2.2.4(TensorFlow 后端)、Python 3.7、Spyder 3.3.7(通过 Anaconda)
- GTX 1070 6GB、i7-7700HQ、12GB 内存、Win-10.0.17134 x64
- CuDNN 10+,最新的 Nvidia 驱动器
附加信息:
模型发散是自发的,发生在不同的训练更新中即使是固定种子 - Numpy、Random 和 TensorFlow 随机种子.此外,当第一次发散时,LSTM 层的权重都是正常的 - 以后只会变成 NaN.
以下依次是: (1) LSTM
的输入;(2) LSTM
输出;(3) Dense(1,'sigmoid')
输出——三个是连续的,每个之间有 Dropout(0.5)
.前面的 (1) 是 Conv1D
层.右图:LSTM 权重.BEFORE" = 1 次列车更新之前;AFTER = 1 次列车更新后
分歧前:
AT 分歧:
## LSTM 输出,扁平化,统计(mean,std) = (inf,nan)(min,max) = (0.00e+00,inf)(abs_min,abs_max) = (0.00e+00,inf)
分歧之后:
## 循环门权重:阵列([[南,南,南,...,南,南,南],[ 0., 0., -0., ..., -0., 0., 0.],[ 0., -0., -0., ..., -0., 0., 0.],...,[难,难,难,...,难,难,难],[ 0., 0., -0., ..., -0., 0., -0.],[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)## 密集 Sigmoid 输出:数组([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],dtype=float32)
最小可重复示例:
from keras.layers import Input,Dense,LSTM,Dropout从 keras.models 导入模型从 keras.optimizers 导入 Nadam从 keras.constraints 导入 MaxNorm 作为 maxnorm将 numpy 导入为 np
ipt = Input(batch_shape=(32,672,16))x = LSTM(512, activation='relu', return_sequences=False,recurrent_dropout=0.3,kernel_constraint =maxnorm(0.5,axis=0),recurrent_constraint=maxnorm(0.5,axis=0))(ipt)out = Dense(1, activation='sigmoid')(x)模型 = 模型(ipt,输出)优化器 = Nadam(lr=4e-4, clipnorm=1)模型.编译(优化器=优化器,损失='binary_crossentropy')
for train_update,_ in enumerate(range(100)):x = np.random.randn(32,672,16)y = np.array([1]*5 + [0]*27)np.random.shuffle(y)损失 = model.train_on_batch(x,y)打印(train_update+1,损失,np.sum(y))
观察:以下加速分歧:
- 更高
units
(LSTM) - 更高层数(LSTM)
- 更高
lr
<<时没有发散,测试到400列火车
- Less
'1'
标签 与下面的y
没有分歧,即使是lr=1e-3
;测试了多达 400 列火车
y = np.random.randint(0,2,32) # 生成更多的1"标签
UPDATE:在 TF2 中未修复;也可以使用 from tensorflow.keras
导入重现.
深入研究 LSTM 公式并挖掘源代码,一切都变得清晰起来.
判决:recurrent_dropout
与此无关;一个东西在没人预料到的地方循环.
真正的罪魁祸首:activation
参数,现在是 'relu'
,应用于循环转换 -与几乎所有将其显示为无害的 'tanh'
的教程相反.
即,activation
不 仅用于隐藏到输出转换 -
Any non-zero recurrent_dropout
yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful
, return_sequences
= any, with & w/o Bidirectional()
, activation='relu'
, loss='binary_crossentropy'
. NaNs occur within a few batches.
Any fixes? Help's appreciated.
TROUBLESHOOTING ATTEMPTED:
recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50
(empirically determined), Nadam optimizeractivation='tanh'
- no NaNs, weights stable, tested for up to 10 batcheslr=2e-6,2e-5
- no NaNs, weights stable, tested for up to 10 batcheslr=5e-5
- no NaNs, weights stable, for 3 batches - NaNs on batch 4batch_shape=(32,48,16)
- large loss for 2 batches, NaNs on batch 3
NOTE: batch_shape=(32,672,16)
, 17 calls to train_on_batch
per batch
ENVIRONMENT:
- Keras 2.2.4 (TensorFlow backend), Python 3.7, Spyder 3.3.7 via Anaconda
- GTX 1070 6GB, i7-7700HQ, 12GB RAM, Win-10.0.17134 x64
- CuDNN 10+, latest Nvidia drives
ADDITIONAL INFO:
Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.
Below are, in order: (1) inputs to LSTM
; (2) LSTM
outputs; (3) Dense(1,'sigmoid')
outputs -- the three are consecutive, with Dropout(0.5)
between each. Preceding (1) are Conv1D
layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after
BEFORE divergence:
AT divergence:
## LSTM outputs, flattened, stats
(mean,std) = (inf,nan)
(min,max) = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)
AFTER divergence:
## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., 0.],
[ 0., -0., -0., ..., -0., 0., 0.],
...,
[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., -0.],
[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
MINIMAL REPRODUCIBLE EXAMPLE:
from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers import Nadam
from keras.constraints import MaxNorm as maxnorm
import numpy as np
ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
recurrent_dropout=0.3,
kernel_constraint =maxnorm(0.5, axis=0),
recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')
for train_update,_ in enumerate(range(100)):
x = np.random.randn(32,672,16)
y = np.array([1]*5 + [0]*27)
np.random.shuffle(y)
loss = model.train_on_batch(x,y)
print(train_update+1,loss,np.sum(y))
Observations: the following speed up divergence:
- Higher
units
(LSTM) - Higher # of layers (LSTM)
- Higher
lr
<< no divergence when<=1e-4
, tested up to 400 trains - Less
'1'
labels << no divergence withy
below, even withlr=1e-3
; tested up to 400 trains
y = np.random.randint(0,2,32) # makes more '1' labels
UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras
imports.
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear.
Verdict: recurrent_dropout
has nothing to do with it; a thing's being looped where none expect it.
Actual culprit: the activation
argument, now 'relu'
, is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'
.
I.e., activation
is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)
Solution(s):
- Apply
BatchNormalization
to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)- If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use
activation=None
, then BN, thenActivation
layer)
- If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use
- Use
activation='selu'
; more stable, but can still diverge - Use lower
lr
- Apply gradient clipping
- Use fewer timesteps
More answers, to some remaining questions:
- Why was
recurrent_dropout
suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement. - Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
- Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
- Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.
UPDATE 1/22/2020: recurrent_dropout
may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here
这篇关于带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!