本文介绍了Keras +张量流+ P100:cudaErrorNotSupported = 71错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!



Apologies if this has been reported already at some other place, I have been looking for it quite some time, without success.

在使用P100 GPGPU使用keras + tensorflow运行简单的mnist示例(在github /fchollet/keras/blob/master/examples/mnist_cnn.py上提供)时,我们在keras/tensorflow/cuda的交集处遇到了一个问题:

While running the simple mnist example (available on github /fchollet/keras/blob/master/examples/mnist_cnn.py) with keras+tensorflow using a P100 GPGPU we encounter an issue at the intersection of keras/tensorflow/cuda:

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:02:00.0
Total memory: 15.89GiB
Free memory: 15.51GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0)
F tensorflow/core/common_runtime/gpu/gpu_device.cc:121] Check failed: err == cudaSuccess (71 vs. 0)
srun: error: nid02011: task 0: Aborted
srun: Terminating job step 1262138.0

我们正在使用keras 2.0.2,tensorflow 1.0.0. CUDA 8.0.53.我们似乎在python2.7.12和python3.5.2(keras 1.2和2.0 ...)中都遇到了这个问题

We are using keras 2.0.2, tensorflow 1.0.0. cuda 8.0.53.We seem to be having this issue both in python2.7.12 and python3.5.2 (keras 1.2 and 2.0 ...)


Bare tensorflow runtest are going fine, which lead us to think that this is really at the intersection of keras/tensorflow/cuda.

同一测试可以在具有相同版本软件但使用TitanX GPGPU的各种机器上正常运行.

The same test runs fine on various machine with the same version of the software but with TitanX GPGPU.

似乎可以追溯到 tensorflow行121


cudaErrorNotSupported = 71
This error indicates the attempted operation is not supported on the current system or device.


I am clueless on where to look next to solve this issue. I would greatly appreciate any feedback and guidance on this matter.


问题的根本原因似乎是Tensorflow与CUDA MPS服务之间的不兼容(请参阅相关的Tensorflow跟踪器问题此处).它只应影响使用MPS服务的群集和大型系统,以提高对GPU设备的访问粒度.

The underlying source of the problem here appears to be an incompatibility between Tensorflow and the CUDA MPS service (see a related Tensorflow tracker issue here). It should only effect clusters and large systems which use the MPS service to improve the granularity of access to GPU devices.

这可能是 Tensorflow开发团队的一个错误.

This should probably be raised as a bug with the Tensorflow development team.

已编辑以添加来自Tensorflow Tracker问题的诊断:


It appears the underlying reason is the extensive use of stream callbacks in Tensorflow, which MPS has not supported before the recent Volta hardware release from NVIDIA. Apparently it is also possible to build Tensorflow from source with options which will make it work correctly with MPS on earlier hardware as well. See the linked tracker discussion for more details.


[This answer was assembled from comments and added as a community wiki entry in order to get it off the unanswered list for the CUDA tag]

这篇关于Keras +张量流+ P100:cudaErrorNotSupported = 71错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 14:10