RNN 正则化:要正则化哪个组件?

本文介绍了RNN 正则化:要正则化哪个组件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在构建一个用于分类的 RNN(在 RNN 之后有一个 softmax 层).要规范化的选项有很多，我不确定是否只尝试所有这些，效果会一样吗?在什么情况下我应该对哪些组件进行正则化?

组件是:

内核权重(层输入)
循环权重
偏见
激活函数(层输出)

解决方案

最有效的正则化器将取决于您的特定架构、数据和问题；像往常一样，没有一个单一的削减来统治所有，但有做和(特别是)不做，以及系统方法来决定什么效果最好 - 通过仔细的反省和评估.

cannot set use_bias=False as an "equivalent"; BN applies to outputs, not hidden-to-hidden transforms.

Zoneout: don't know, never tried, might work - see paper.

Layer Normalization: some report it working better than BN for RNNs - but my application found it otherwise; paper

Data shuffling: is a strong regularizer. Also shuffle batch samples (samples in batch). See relevant info on stateful RNNs

Optimizer: can be an inherent regularizer. Don't have a full explanation, but in my application, Nadam (& NadamW) has stomped every other optimizer - worth trying.

Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.

BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.

This is too much! Agreed - welcome to Deep Learning. Two tips here:

Bayesian Optimization; will save you time especially on prohibitively expensive training.
Conv1D(strides > 1), for many timesteps (>1000); slashes dimensionality, shouldn't harm performance (may in fact improve it).

Introspection Code:

Gradients: see this answer

Weights: see this answer

Weight norm tracking: see this Q & A

Activations: see this answer

Weights: see_rnn.rnn_histogram or see_rnn.rnn_heatmap (examples in README)

How does 'learning' work?

The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:

Train set global optimum can lie very far from test set global optimum
Local optima are unimportant, and irrelevant:
- Train set local optimum is almost always a better test set optimum
- Actual local optima are almost impossible for high-dimensional problems; for the case of the "saddle", you'd need the gradients w.r.t. all of the millions of parameters to equal zero at once
- Local attractors are lot more relevant; the analogy then shifts from "falling into a pit" to "gravitating into a strong field"; once in that field, your loss surface topology is bound to that set up by the field, which defines its own local optima; high LR can help exit a field, much like "escape velocity"

Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:

Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.
Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.

How does regularization work? read above first

In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".

RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".

RNN width is defined by (1) # of input channels; (2) # of cell's filters (output channels). As with CNN, each RNN filter is an independent feature extractor: more is suited for higher-complexity information, including but not limited to: dimensionality, modality, noise, frequency.
RNN depth is defined by (1) # of stacked layers; (2) # of timesteps. Specifics will vary by architecture, but from information standpoint, unlike CNNs, RNNs are dense: every timestep influences the ultimate output of a layer, hence the ultimate output of the next layer - so it again isn't as simple as "more nonlinearity"; stacked RNNs exploit both spatial and temporal information.

Update:

Here is an example of a near-ideal RNN gradient propagation for 170+ timesteps:

This is rare, and was achieved via careful regularization, normalization, and hyperparameter tuning. Usually we see a large gradient for the last few timesteps, which drops off sharply toward left - as here. Also, since the model is stateful and fits 7 equivalent windows, gradient effectively spans 1200 timesteps.

Update 2: see 9 w/ new info & correction

Update 3: add weight norms & weights introspection code

这篇关于RNN 正则化:要正则化哪个组件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！