问题描述
我正在尝试使用 tf.distribute.MirroredStrategy() 训练多 GPU.
I'm trying to train with multi-gpu using tf.distribute.MirroredStrategy().
在多次尝试应用到我的自定义代码后,它有一些关于 NcclAllReduce 的错误.
After several attempt to apply to my custom code, it has some error about NcclAllReduce.
所以我使用 tf.distribute 从 tensorflow 页面复制了 mnist 教程,运行它有同样的错误.日志和我的环境在下面
So I copied mnist tutorial using tf.distribute from tensorflow page, running it has same error. logs and my environments are below
我的环境
sys.platform---------窗口 10
Python---------3.7.6
Python----------3.7.6
Numpy---------1.18.1
Numpy----------1.18.1
TensorFlow---------2.0.0
TensorFlow----------2.0.0
TF CUDA 支持-----------正确
TF CUDA support-----------True
GPU---------2个GPU,都是Quadro GV100
GPU----------2 GPU, both are Quadro GV100
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-12-a7ead7e91ea5> in <module>
19 num_batches = 0
20 for x in train_dist_dataset:
---> 21 total_loss += distributed_train_step(x)
22 num_batches += 1
23 train_loss = total_loss / num_batches
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\def_function.py in __call__(self, *args, **kwds)
455
456 tracing_count = self._get_tracing_count()
--> 457 result = self._call(*args, **kwds)
458 if tracing_count == self._get_tracing_count():
459 self._call_counter.called_without_tracing()
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\def_function.py in _call(self, *args, **kwds)
518 # Lifting succeeded, so variables are initialized and we can run the
519 # stateless function.
--> 520 return self._stateless_fn(*args, **kwds)
521 else:
522 canon_args, canon_kwds = \
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\function.py in __call__(self, *args, **kwargs)
1821 """Calls a graph function specialized to the inputs."""
1822 graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 1823 return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
1824
1825 @property
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\function.py in _filtered_call(self, args, kwargs)
1139 if isinstance(t, (ops.Tensor,
1140 resource_variable_ops.BaseResourceVariable))),
-> 1141 self.captured_inputs)
1142
1143 def _call_flat(self, args, captured_inputs, cancellation_manager=None):
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1222 if executing_eagerly:
1223 flat_outputs = forward_function.call(
-> 1224 ctx, args, cancellation_manager=cancellation_manager)
1225 else:
1226 gradient_name = self._delayed_rewrite_functions.register()
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\function.py in call(self, ctx, args, cancellation_manager)
509 inputs=args,
510 attrs=("executor_type", executor_type, "config_proto", config),
--> 511 ctx=ctx)
512 else:
513 outputs = execute.execute_with_cancellation(
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\tensorflow_core\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
65 else:
66 message = e.message
---> 67 six.raise_from(core._status_to_exception(e.code, message), None)
68 except TypeError as e:
69 keras_symbolic_tensors = [
~\Anaconda3\envs\tf-MSTO-DL\lib\site-packages\six.py in raise_from(value, from_value)
InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node Adam/NcclAllReduce}}with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU]
Registered kernels:
<no registered kernels>
[[Adam/NcclAllReduce]] [Op:__inference_distributed_train_step_1755]
推荐答案
cross_device_ops
有几个选项,好像
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.NcclAllReduce())
会根据您的架构和配置产生 NCCL 错误.
would produce a NCCL error depending on your architecture and configuration.
此选项适用于 NVIDIA DGX-1 架构,可能在其他架构上表现不佳:
This option was meant for NVIDIA DGX-1 architecture and might underperform on other architectures :
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
应该可以:
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.ReductionToOneDevice())
这样可以建议尝试不同的选项.
So that it can be advised to try the different options.
这篇关于Tensorflow 2.0.0 MirroredStrategy NCCL 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!