问题描述
我正在训练一个 CNN.以下错误在本周出现 3 次.它们都在长时间运行后出现(例如,419140 步).
I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).
这是部分日志:
2017-09-15 11:16:03.515396:步骤 419120,损失 = 0.30 (4427.4示例/秒;0.029 秒/批)2017-09-15 11:16:03.766922:步骤419130,损失 = 0.38(5089.0 个样本/秒;0.025 秒/批次)2017-09-1511:16:04.073978:步骤 419140,损失 = 0.40(4168.5 个样本/秒;0.031秒/批) 2017-09-15 20:48:03.734101: Etensorflow/stream_executor/cuda/cuda_event.cc:49] 轮询错误事件状态:无法查询事件:CUDA_ERROR_LAUNCH_FAILED2017-09-15 20:48:03.734133:Ftensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] 意外事件状态:1
如果我重新开始训练,tensorflow 将不会使用 GPU,这是相关的日志:
If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:
2017-09-15 21:54:38.681074:Etensorflow/stream_executor/cuda/cuda_driver.cc:406] 调用失败cuInit: CUDA_ERROR_UNKNOWN
要让 GPU 重新工作,我必须重新启动计算机.
To make GPU work again, I have to restart my computer.
错误似乎发生在我不熟悉的 c++ 文件中.有人能给我一些有关如何调试或解决此错误的建议吗?
It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?
推荐答案
我遇到了同样的问题,我在这里找到了一个关于它为什么发生的建议:https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/
I faced the same problem and I found a suggestion on why it's happening here : https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/
显然,当 Nvidia GPU 过热时,它会抛出此错误!
Apparently, when Nvidia GPU overheats it throws this error!
这篇关于长期运行后,张量流抛出 CUDA_ERROR_LAUNCH_FAILED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!