问题描述
在调整超参数以使我的模型性能更好的同时,我注意到每次运行代码时我得到的分数(以及因此创建的模型)都是不同的,尽管为随机操作修复了所有种子.如果我在 CPU 上运行,则不会发生此问题.
While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.
我用谷歌搜索并发现这是使用 GPU 进行训练时的常见问题.这是一个非常好的/详细的例子代码片段来验证该问题的存在.
I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.
他们指出了tf.reduce_sum"函数的不确定性.然而,对我来说情况并非如此.这可能是因为我使用了不同的硬件(1080 TI)或不同版本的 CUDA 库或 Tensorflow.似乎 CUDA 库中有许多不同的部分是不确定的,而且要弄清楚究竟是哪一部分以及如何摆脱它似乎并不容易.此外,这一定是设计使然,因此很可能有足够的效率提升来换取非确定性.
They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.
所以,我的问题是:
由于 GPU 在训练 NN 方面很受欢迎,所以这个领域的人必须有办法处理不确定性,因为我看不出还有什么方法可以可靠地调整超参数.使用 GPU 时处理不确定性的标准方法是什么?
Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?
推荐答案
TL;DR
- 先验确定性操作的非确定性来自并发(多线程)实现.
- 尽管在这方面不断取得进展,但 TensorFlow 目前并不能保证其所有操作的确定性.在网上快速搜索后,似乎情况与其他主要工具包类似.
- 在训练期间,除非您正在调试问题,否则在运行之间存在波动是可以的.不确定性是训练的本质,在比较结果时衡量它并考虑它是明智的 - 即使工具包最终在训练中达到完美的确定性.
- Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
- Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar with the other major toolkits.
- During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.
那个,但更长
当您将神经网络运算视为数学运算时,您会期望一切都是确定性的.卷积、激活、交叉熵——这里的一切都是数学方程,应该是确定性的.即使是诸如shuffle、dropout、noise等伪随机操作,也完全由种子决定.
When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.
另一方面,当您从计算实现中看到这些操作时,您会将它们视为大规模并行计算,除非您非常小心,否则这可能是随机性的来源.
When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.
问题的核心是,当您在多个并行线程上运行操作时,您通常不知道哪个线程将首先结束.线程何时对自己的数据进行操作并不重要,因此例如,将激活函数应用于张量应该是确定性的.但是当这些线程需要同步时,比如计算求和时,结果可能取决于求和的顺序,进而取决于哪个线程先结束的顺序.
The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order of which thread ended first.
从那里开始,您大致上有两种选择:
From there, you have broadly speaking two options:
保持与更简单实现相关的不确定性.
Keep non-determinism associated with simpler implementations.
在设计并行算法时要格外小心,以减少或消除计算中的非确定性.添加的约束通常会导致算法变慢
Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms
CuDNN 走哪条路?嗯,主要是确定性的.在最近的版本中,确定性操作是常态而不是例外.但是它曾经提供了许多非确定性操作,更重要的是,它过去没有提供一些操作,例如归约,人们需要在 CUDA 中实现自己,并在不同程度上考虑确定性.
Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.
诸如 theano 之类的一些库更领先于这个主题,通过在 确定性
标志,表示用户可以打开或关闭——但正如你从它的描述中看到的,它远不能提供任何保证.
Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic
flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.
如果more
,有时我们会选择一些更具确定性但速度较慢的实现.特别是在 GPU 上,我们将避免使用 AtomicAdd.有时我们仍然会使用非确定性实现,例如当我们没有确定性的 GPU 实现时.另请参阅 dnn.conv.algo* 标志以涵盖更多情况.
在 TensorFlow 中,对确定性的需求的实现已经相当晚了,但实现起来却很慢——这也得益于 CuDNN 在这方面的进步.长期以来,减少一直是非确定性的,但现在它们似乎是确定性的.CuDNN 在 6.0 版本中引入了确定性缩减这一事实当然可能有所帮助.
In TensorFlow, the realization of the need for determinism has been rather late, but it's slow getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.
目前看来,TensorFlow 确定性的主要障碍是后向传递的卷积.这确实是 CuDNN 提出非确定性算法的少数操作之一,标记为 CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
.该算法仍在可能的选择列表中TensorFlow 中的反向过滤器.并且由于 过滤器的选择似乎是基于性能,如果它更有效,它确实可以被选中.(我对 TensorFlow 的 C++ 代码不太熟悉,所以请谨慎对待.)
It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0
. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)
这很重要吗?
如果您正在调试问题,确定性不仅重要:它是强制性的.您需要重现导致问题的步骤.这是目前像 TensorFlow 这样的工具包的一个真正问题.为了缓解这个问题,您唯一的选择是实时调试,在正确的位置添加检查和断点 - 不太好.
If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.
部署是事物的另一个方面,通常需要确定性行为,部分是为了人类的接受.虽然没有人会合理地期望医疗诊断算法永远不会失败,但计算机可以根据运行情况为同一患者提供不同的诊断是很尴尬的.(尽管医生自己也不能幸免于这种可变性.)
Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)
这些原因是修复神经网络中的非确定性的正当动机.
Those reasons are rightful motivations to fix non-determinism in neural networks.
对于所有其他方面,我认为我们需要接受(如果不接受)神经网络训练的非确定性本质.出于所有目的,训练是随机的.我们使用随机梯度下降、混洗数据、使用随机初始化和 dropout——更重要的是,训练数据本身只是数据的随机样本.从这个角度来看,计算机只能用种子生成伪随机数的事实是一个伪随机数.当你训练时,你的损失是一个值,由于这种随机性,它也带有置信区间.比较这些值以优化超参数而忽略这些置信区间没有多大意义 - 因此,在我看来,在这种情况下以及许多其他情况下花费太多精力来解决不确定性是徒劳的.
For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.
这篇关于在 GPU 上训练时如何处理不确定性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!