问题描述
在 Tensorflow 的 RNN 教程中:https://www.tensorflow.org/tutorials/recurrent.它提到了两个参数:批量大小和时间步长.我对这些概念感到困惑.在我看来,RNN 引入批处理是因为 to-train 序列可能很长,以至于反向传播无法计算那么长(爆炸/消失梯度).所以我们将长的 to-train 序列分成更短的序列,每个序列都是一个 mini-batch,其大小称为batch size".我在吗?
In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent. It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?
关于时间步长,RNN 仅由一个单元(LSTM 或 GRU 单元,或其他单元)组成,并且该单元是连续的.我们可以通过展开它来理解顺序概念.但是展开顺序单元是一个概念,不是真实的,这意味着我们不会以展开的方式实现它.假设要训练的序列是一个文本语料库.然后我们每次向 RNN 单元输入一个单词,然后更新权重.那么为什么我们在这里有时间步长呢?结合我对上面批量大小"的理解,我更加困惑了.我们给单元格输入一个词还是多个词(批量大小)?
Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?
推荐答案
批次大小与更新网络权重时要考虑的训练样本数量有关.因此,在前馈网络中,假设您希望根据一次一个词的梯度计算更新网络权重,您的 batch_size = 1.由于梯度是从单个样本计算的,因此在计算上非常便宜.另一方面,它也是非常不稳定的训练.
Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1.As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.
要了解在训练这种前馈网络期间发生了什么,我会向你推荐这个 single_batch 与 mini_batch 的非常好的视觉示例单样本训练.
To understand what happen during the training of such a feedforward network, I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.
但是,您想了解 num_steps 变量会发生什么.这与您的 batch_size 不同.您可能已经注意到,到目前为止,我已经提到了前馈网络.在前馈网络中,输出由网络输入确定,输入-输出关系由学习到的网络关系映射:
However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:
hidden_activations(t) = f(input(t))
输出(t) = g(hidden_activations(t)) = g(f(input(t)))
经过大小为 batch_size 的训练过程后,将计算您的损失函数相对于每个网络参数的梯度并更新您的权重.
After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.
然而,在循环神经网络 (RNN) 中,您的网络功能略有不同:
In a recurrent neural network (RNN), however, your network functions a tad differently:
hidden_activations(t) = f(input(t), hidden_activations(t-1))
output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))
=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state)))))
正如您从命名意义上推测的那样,网络保留了其先前状态的记忆,并且神经元激活现在也依赖于先前的网络状态,进而依赖于网络曾经发现自己所处的所有状态. 大多数 RNN 使用健忘因子,以便更加重视最近的网络状态,但这与您的问题无关.
As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.
然后,如果您必须考虑自网络创建以来的所有状态的反向传播,那么计算损失函数相对于网络参数的梯度在计算上非常非常昂贵,有一个整洁的加速计算的小技巧:用历史网络状态的子集num_steps来近似梯度.
Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.
如果这个概念讨论还不够清楚,你也可以看看一个以上的更多数学描述.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.
这篇关于关于 RNN 中批量大小和时间步长的疑问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!