本文介绍了几个时期后的 tensorflow-GPU OOM 问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 tensorflow 在 Nvidia Geforce 1060(6G 内存)上训练 CNN,但我遇到了 OOM 异常.

I used tensorflow to train CNN with Nvidia Geforce 1060 (6G memory), but I got a OOM exception.

前两个时期的训练过程很好,但在第三个时期出现了 OOM 异常.

The training process was fine on first two epochs, but got the OOM exception on the third epoch.

============================2017-10-27 11:47:30.219130: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ******************************************************************************************************xxxxxx2017-10-27 11:47:30.265389: W tensorflow/core/framework/op_kernel.cc:1192] 资源耗尽:分配形状为 [10,10,48,48,48] 的张量时 OOM回溯(最近一次调用最后一次):文件/anaconda3/lib/python3.6/sitepackages/tensorflow/python/client/session.py",第 1327 行,在 _do_call返回 fn(*args)文件/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py",第 1306 行,在 _run_fn状态,运行元数据)文件/anaconda3/lib/python3.6/contextlib.py",第 88 行,出口下一个(self.gen)文件/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py",第466行,raise_exception_on_not_ok_statuspywrap_tensorflow.TF_GetCode(status))tensorflow.python.framework.errors_impl.ResourceExhaustedError:在分配形状为[10,10,48,48,48]的张量时出现OOM[[节点:gradients_4/global/detector_scope/maxpool_conv3d_2/MaxPool3D_grad/MaxPool3DGrad = MaxPool3DGrad[T=DT_FLOAT, TInput=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](global/detector_scope/maxpool_conv3d_2/transpose, global/detector_scope/maxpool_conv3d_2/MaxPool3D,gradients_4/global/detector_scope/maxpool_conv3d_2/transpose_1_grad/transpose)]][[节点:Momentum_4/update/_540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1540_Momentum_4/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

============================2017-10-27 11:47:30.219130: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************xxxxxx2017-10-27 11:47:30.265389: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[10,10,48,48,48]Traceback (most recent call last):File"/anaconda3/lib/python3.6/sitepackages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args)File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata)File "/anaconda3/lib/python3.6/contextlib.py", line 88, in exit next(self.gen)File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status))tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,10,48,48,48] [[Node: gradients_4/global/detector_scope/maxpool_conv3d_2/MaxPool3D_grad/MaxPool3DGrad = MaxPool3DGrad[T=DT_FLOAT, TInput=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](global/detector_scope/maxpool_conv3d_2/transpose, global/detector_scope/maxpool_conv3d_2/MaxPool3D, gradients_4/global/detector_scope/maxpool_conv3d_2/transpose_1_grad/transpose)]] [[Node: Momentum_4/update/_540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1540_Momentum_4/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

==============================

=============================

所以,我很困惑为什么在处理完前两个时期后我在第三个时期出现这个 OOM 异常.

So, I am confused why I got this OOM exception on third epoch after it finishes processing the first two epochs.

鉴于每个时期的数据集都是相同的,如果我用完了 GPU 内存,我应该在第一个时期得到异常.但我确实成功完成了两个时代.那么,为什么后来发生了这种情况?

Given that datasets are the same during each epoch, if I ran out of GPU memory, I should get the exception on the first epoch. But I did successfully finish two epochs. So, why did this happen later ?

有什么建议吗?

推荐答案

在您第一次开始训练和至少一个 epoch 完成之后,您可能会看到两次 OOM 错误.

There are two times you are likely to see OOM errors, as you first start training and after at least one epoch has completed.

第一种情况仅仅是由于模型的内存大小.为此,最简单的方法是减少批量大小.如果您的模型非常大而您的批量大小现在已降至 1,您仍然有几个选择:减少隐藏层的大小或移动到具有足够 GPU 甚至仅 CPU 执行的云实例,以便内存的静态分配起作用.

The first situation is simply due to the model's memory size. For that, the easiest way is to reduce the batch size. If your model is really big and your batch size is now down to one, you still have a few options: reduce the size of hidden layers or move to a cloud instance with enough GPU or even CPU only execution so the static allocation of memory works.

对于第二种情况,您可能会遇到各种内存泄漏.许多训练实现使用对保留数据集的回调来获得验证分数.如果由 Keras 调用,则此执行可能会占用 GPU 会话资源.如果不释放,这些就会累积,并可能导致 GPU 实例在几个 epoch 后报告 OOM.其他人建议为验证会话使用第二个 GPU 实例,但我认为更好的方法是使用更智能的验证回调会话处理(特别是在每个验证回调完成时释放 GPU 会话资源.)

For the second situation, you are likely running into a memory leak of sorts. Many training implementations use a callback on a hold-out dataset to get a validation score. This execution, say if called by Keras, may hold on to GPU session resources. These build up if not released and can cause a GPU instance to report OOM after several epochs. Others have suggested using a second GPU instance for the validation session, but I think a better approach is to have smarter validation callback session handling (specifically to release GPU session resources when each validation callback completes.)

这是说明回调问题的伪代码.此回调导致 OOM:

Here is the pseudo code illustrating the callback problem. This callback leads to OOM:

my_models_validation_score = tf.get_some_v_score

这个回调不会导致OOM:

This callback does not lead to OOM:

with tf.Session() as sess:
    sess.run(get_some_v_score)

我邀请其他人帮助添加此回复...

I invite others to help add to this response...

这篇关于几个时期后的 tensorflow-GPU OOM 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-14 15:32