问题描述
问题:当我运行以下命令时
Problem:when I run the following command
python -c "import tensorflow as tf; tf.test.is_gpu_available(); print('version :' + tf.__version__)"
错误:
RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
详情:
WARNING:tensorflow:From :1: is_gpu_available(来自 tensorflow.python.framework.test_util)已弃用,将在未来版本中删除.更新说明:请改用 tf.config.list_physical_devices('GPU')
.2021-04-18 21:02:51.839069: I tensorflow/core/platform/cpu_feature_guard.cc:143] 您的 CPU 支持该 TensorFlow 二进制文件未编译使用的指令:AVX2 AVX512F FMA2021-04-18 21:02:51.846775: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU 频率:2500000000 Hz2021-04-18 21:02:51.847076: I tensorflow/compiler/xla/service/service.cc:168] XLA 服务 0x7fc3bc000b20 为平台主机初始化(这不保证会使用 XLA).设备:2021-04-18 21:02:51.847104:I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor 设备(0):主机,默认版本2021-04-18 21:02:51.849876:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcuda.so.12021-04-18 21:02:51.911161:W tensorflow/compiler/xla/service/platform_util.cc:210] 无法为 CUDA 创建 StreamExecutor:0:无法为 CUDA 设备初始化 StreamExecutor 序号 0:内部:调用 cuDevicexRexRetainCt 失败:CUDA_ERROR_UNKNOWN:未知错误2021-04-18 21:02:51.911285:我 tensorflow/compiler/jit/xla_gpu_device.cc:161] 忽略可见的 XLA_GPU_JIT 设备.设备编号为 0,原因:内部:未找到平台 CUDA 支持的设备2021-04-18 21:02:51.911546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 成功从 SysFS 读取的 NUMA 节点具有负值 (-1),但必须至少有一个 NUMA 节点,因此返回NUMA 节点零2021-04-18 21:02:51.912210:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] 发现设备 0 具有以下属性:pciBusID:0000:00:07.0 名称:GRID T4-4Q 计算能力:7.5coreClock:1.59GHz coreCount:40 deviceMemorySize:3.97GiB deviceMemoryBandwidth:298.08GiB/s2021-04-18 21:02:51.912446:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudart.so.10.12021-04-18 21:02:51.914362:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcublas.so.102021-04-18 21:02:51.916358:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcufft.so.102021-04-18 21:02:51.916679:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcurand.so.102021-04-18 21:02:51.918787:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcusolver.so.102021-04-18 21:02:51.919993:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcusparse.so.102021-04-18 21:02:51.924652:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudnn.so.72021-04-18 21:02:51.924792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 成功从 SysFS 读取的 NUMA 节点具有负值 (-1),但必须至少有一个 NUMA 节点,因此返回NUMA 节点零2021-04-18 21:02:51.925488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] 从 SysFS 读取的成功 NUMA 节点具有负值 (-1),但必须至少有一个 NUMA 节点,因此返回NUMA 节点零2021-04-18 21:02:51.926100:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] 添加可见的 gpu 设备:02021-04-18 21:02:51.926146:我 tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudart.so.10.1回溯(最近一次调用最后一次):文件",第 1 行,在文件/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py",第324行,在new_func返回 func(*args, **kwargs)文件/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py",第1496行,在is_gpu_available中对于 device_lib.list_local_devices() 中的 local_device:文件/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py",第43行,在list_local_devices_convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)运行时错误:GPU:0 上的 CUDA 运行时隐式初始化失败.状态:所有支持 CUDA 的设备都忙或不可用
WARNING:tensorflow:From :1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.Instructions for updating:Use tf.config.list_physical_devices('GPU')
instead.2021-04-18 21:02:51.839069: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA2021-04-18 21:02:51.846775: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2500000000 Hz2021-04-18 21:02:51.847076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc3bc000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:2021-04-18 21:02:51.847104: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version2021-04-18 21:02:51.849876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.12021-04-18 21:02:51.911161: W tensorflow/compiler/xla/service/platform_util.cc:210] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error2021-04-18 21:02:51.911285: I tensorflow/compiler/jit/xla_gpu_device.cc:161] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA2021-04-18 21:02:51.911546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2021-04-18 21:02:51.912210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:pciBusID: 0000:00:07.0 name: GRID T4-4Q computeCapability: 7.5coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 3.97GiB deviceMemoryBandwidth: 298.08GiB/s2021-04-18 21:02:51.912446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.12021-04-18 21:02:51.914362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.102021-04-18 21:02:51.916358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.102021-04-18 21:02:51.916679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.102021-04-18 21:02:51.918787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.102021-04-18 21:02:51.919993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.102021-04-18 21:02:51.924652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.72021-04-18 21:02:51.924792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2021-04-18 21:02:51.925488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2021-04-18 21:02:51.926100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 02021-04-18 21:02:51.926146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1Traceback (most recent call last):File "", line 1, inFile "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_funcreturn func(*args, **kwargs)File "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1496, in is_gpu_availablefor local_device in device_lib.list_local_devices():File "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices_convert(s) for s in _pywrap_device_lib.list_devices(serialized_config)RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
操作系统平台和发行版(例如,Linux Ubuntu 16.04):ubuntu 18.04移动设备(例如 iPhone 8、Pixel 2、Samsung Galaxy)如果问题发生在移动设备上:云服务器从(源或二进制)安装的 TensorFlow:源
TensorFlow 版本:2.2.0.Python版本:3.7.7
使用virtualenv安装?点?conda?: pip &康达.
Bazel 版本(如果从源代码编译):2..0.0
GCC/编译器版本(如果从源代码编译):7.5
CUDA/cuDNN 版本:CUDA 10.1 &cuDNN 7.6.5
GPU型号和内存:
00:07.0 VGA 兼容控制器:NVIDIA Corporation 设备 1eb8 (rev a1)(prog-if 00 [VGA 控制器]).
子系统:NVIDIA Corporation Device 130e.
物理插槽:7标志:总线主控,快速devsel
,延迟0,IRQ 37fc000000 处的内存(32 位,不可预取
)[大小=16M]e0000000 处的内存(64 位,可预取
)[大小=256M]fa000000 处的内存(64 位,不可预取
)[size=32M]c500 处的 I/O 端口 [大小=128]功能:[68] MSI:Enable+ Count=1/1 Maskable- 64bit+使用的内核驱动程序:nvidia内核模块:nvidiafb
、nouveau、nvidia_drm
、nvidia
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: cloud serverTensorFlow installed from (source or binary): source
TensorFlow version: 2.2.0.Python version: 3.7.7
Installed using virtualenv? pip? conda?: pip & conda.
Bazel version (if compiling from source): 2..0.0
GCC/Compiler version (if compiling from source): 7.5
CUDA/cuDNN version: CUDA 10.1 & cuDNN 7.6.5
GPU model and memory:
00:07.0 VGA compatible controller:NVIDIA Corporation Device 1eb8 (rev a1) (prog-if 00 [VGA controller]).
Subsystem: NVIDIA Corporation Device 130e.
Physical Slot: 7Flags: bus master, fast devsel
, latency 0, IRQ 37Memory at fc000000 (32-bit, non-prefetchable
) [size=16M]Memory at e0000000 (64-bit, prefetchable
) [size=256M]Memory at fa000000 (64-bit, non-prefetchable
) [size=32M]I/O ports at c500 [size=128]Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+Kernel driver in use: nvidiaKernel modules: nvidiafb
, nouveau, nvidia_drm
, nvidia
我试图寻找这个问题的解决方案,但没有一个解决它:
I tried looking for solutions to this problem but none of them solved it:
https://github.com/tensorflow/tensorflow/issues/41990
Tensorflow-GPU 错误:"RuntimeError:GPU:0 上的 CUDA 运行时隐式初始化失败.状态:所有支持 CUDA 的设备都忙或不可用"
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#recommended-post
https://github.com/tensorflow/tensorflow/issues/48558
https://programmersought.com/article/94034772029/
推荐答案
我可以确认评论中提到的案例.
I can confirm the case mentioned in a comment.
我在使用 Ubuntu 虚拟机时遇到了问题,在 VMware ESXi 主机上执行,并为 v100 Nvidia GPU 使用 vGPU 分区.
I had the problem while working with an Ubuntu VM, executed on VMware ESXi host, and using a vGPU partition for a v100 Nvidia GPU.
我遇到了同样的错误,我已经尝试更改 cuda 版本并下载为该特定 CUDA 版本编译的 (pip) 软件,这并没有解决问题,错误:
I got the same error, and I have already tried changing cuda versions and downloading (pip) softwares compiled for that specific CUDA versions, this has NOT solved the issue, the error:
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
在我的情况下,我忘记在 /etc/nvidia/grid.conf
中设置许可证服务器,我得到了完全相同的错误,所以在我的情况下,这是一个 GRID 许可证问题... 修复网格配置文件并重新启动解决了问题.
In my case I forgot to set the license server in /etc/nvidia/grid.conf
, and I got exactly the same error, so in my case it was a GRID license issue ... fixing the grid config file and rebooting solved the issue.
这篇关于运行时错误:GPU:0 上的 CUDA 运行时隐式初始化失败.状态:所有支持 CUDA 的设备都忙或不可用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!