查看CloudWatch日志记录,我可以看到一些工作指标.具体来说,GPU Memory Utilization,CPU Utilization可以,而GPU utilization为0%. [UPDATE] 由于Keras上的已知错误,关于保存多GPU模型,我在 keras.utils 中使用了对 multi_gpu_model 实用程序的覆盖from keras.layers import Lambda, concatenatefrom keras import Modelimport tensorflow as tfdef multi_gpu_model(model, gpus): #source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044 if isinstance(gpus, (list, tuple)): num_gpus = len(gpus) target_gpu_ids = gpus else: num_gpus = gpus target_gpu_ids = range(num_gpus) def get_slice(data, i, parts): shape = tf.shape(data) batch_size = shape[:1] input_shape = shape[1:] step = batch_size // parts if i == num_gpus - 1: size = batch_size - step * i else: size = step size = tf.concat([size, input_shape], axis=0) stride = tf.concat([step, input_shape * 0], axis=0) start = stride * i return tf.slice(data, start, size) all_outputs = [] for i in range(len(model.outputs)): all_outputs.append([]) # Place a copy of the model on each GPU, # each getting a slice of the inputs. for i, gpu_id in enumerate(target_gpu_ids): with tf.device('/gpu:%d' % gpu_id): with tf.name_scope('replica_%d' % gpu_id): inputs = [] # Retrieve a slice of the input. for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_i = Lambda(get_slice, output_shape=input_shape, arguments={'i': i, 'parts': num_gpus})(x) inputs.append(slice_i) # Apply model on slice # (creating a model replica on the target device). outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save the outputs for merging back together later. for o in range(len(outputs)): all_outputs[o].append(outputs[o]) # Merge outputs on CPU. with tf.device('/cpu:0'): merged = [] for name, outputs in zip(model.output_names, all_outputs): merged.append(concatenate(outputs, axis=0, name=name)) return Model(model.inputs, merged)这在本地2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04上可以正常工作.在SageMaker培训作业上将失败.我已在的AWS Sagemaker论坛上发布了此问题 带有Keras后端和多GPU的TrainingJob自定义算法 将Multi-GPU与keras.utils.multi_gpu_model [UPDATE] 我对tf.session代码进行了少许修改,添加了一些初始化程序with tf.Session() as session: K.set_session(session) session.run(tf.global_variables_initializer()) session.run(tf.tables_initializer()),现在至少可以看到实例指标中使用了一个GPU(我假设设备为gpu:0).多GPU仍然无法正常工作.解决方案这可能不是解决您问题的最佳方法,但这就是我正在使用Tensorflow后端的多GPU模型的目的.首先,我使用以下方法进行初始化:def setup_multi_gpus(): """ Setup multi GPU usage Example usage: model = Sequential() ... multi_model = multi_gpu_model(model, gpus=num_gpu) multi_model.fit() About memory usage: https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory """ import tensorflow as tf from keras.utils.training_utils import multi_gpu_model from tensorflow.python.client import device_lib # IMPORTANT: Tells tf to not occupy a specific amount of memory from keras.backend.tensorflow_backend import set_session config = tf.ConfigProto() config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU sess = tf.Session(config=config) set_session(sess) # set this TensorFlow session as the default session for Keras. # getting the number of GPUs def get_available_gpus(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos if x.device_type == 'GPU'] num_gpu = len(get_available_gpus()) print('Amount of GPUs available: %s' % num_gpu) return num_gpu然后我打电话# Setup multi GPU usagenum_gpu = setup_multi_gpus()并创建一个模型....之后,您可以使其成为多GPU模型.multi_model = multi_gpu_model(model, gpus=num_gpu)multi_model.compile...multi_model.fit...这里唯一与您所做的不同的是Tensorflow初始化GPU的方式.我无法想象这是问题所在,但可能值得尝试. 祝你好运! 我注意到序列无法与多GPU一起使用.这是您要训练的模型类型吗?Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:from keras.utils import multi_gpu_modelparallel_model = multi_gpu_model(model, gpus=K)parallel_model.compile(loss='categorical_crossentropy',optimizer='rmsprop')parallel_model.fit(x, y, epochs=20, batch_size=256)This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.[UPDATE]I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangsThis logging repeats twice2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 32018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 32018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)Before there is some logging info about each GPU, that repeats 4 times2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53pciBusID: 0000:00:1e.0totalMemory: 15.78GiB freeMemory: 15.37GiBAccording to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.[UPDATE]Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utilsfrom keras.layers import Lambda, concatenatefrom keras import Modelimport tensorflow as tfdef multi_gpu_model(model, gpus): #source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044 if isinstance(gpus, (list, tuple)): num_gpus = len(gpus) target_gpu_ids = gpus else: num_gpus = gpus target_gpu_ids = range(num_gpus) def get_slice(data, i, parts): shape = tf.shape(data) batch_size = shape[:1] input_shape = shape[1:] step = batch_size // parts if i == num_gpus - 1: size = batch_size - step * i else: size = step size = tf.concat([size, input_shape], axis=0) stride = tf.concat([step, input_shape * 0], axis=0) start = stride * i return tf.slice(data, start, size) all_outputs = [] for i in range(len(model.outputs)): all_outputs.append([]) # Place a copy of the model on each GPU, # each getting a slice of the inputs. for i, gpu_id in enumerate(target_gpu_ids): with tf.device('/gpu:%d' % gpu_id): with tf.name_scope('replica_%d' % gpu_id): inputs = [] # Retrieve a slice of the input. for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_i = Lambda(get_slice, output_shape=input_shape, arguments={'i': i, 'parts': num_gpus})(x) inputs.append(slice_i) # Apply model on slice # (creating a model replica on the target device). outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save the outputs for merging back together later. for o in range(len(outputs)): all_outputs[o].append(outputs[o]) # Merge outputs on CPU. with tf.device('/cpu:0'): merged = [] for name, outputs in zip(model.output_names, all_outputs): merged.append(concatenate(outputs, axis=0, name=name)) return Model(model.inputs, merged)This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.I have posted this issue on AWS Sagemaker forum in TrainingJob custom algorithm with Keras backend and multi GPUSageMaker Fails when using Multi-GPU withkeras.utils.multi_gpu_model[UPDATE]I have slightly modified the tf.session code adding some initializerswith tf.Session() as session: K.set_session(session) session.run(tf.global_variables_initializer()) session.run(tf.tables_initializer())and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways. 解决方案 This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using: def setup_multi_gpus(): """ Setup multi GPU usage Example usage: model = Sequential() ... multi_model = multi_gpu_model(model, gpus=num_gpu) multi_model.fit() About memory usage: https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory """ import tensorflow as tf from keras.utils.training_utils import multi_gpu_model from tensorflow.python.client import device_lib # IMPORTANT: Tells tf to not occupy a specific amount of memory from keras.backend.tensorflow_backend import set_session config = tf.ConfigProto() config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU sess = tf.Session(config=config) set_session(sess) # set this TensorFlow session as the default session for Keras. # getting the number of GPUs def get_available_gpus(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos if x.device_type == 'GPU'] num_gpu = len(get_available_gpus()) print('Amount of GPUs available: %s' % num_gpu) return num_gpuThen i call# Setup multi GPU usagenum_gpu = setup_multi_gpus()and create a model....After which you're able to make it a multi GPU model.multi_model = multi_gpu_model(model, gpus=num_gpu)multi_model.compile...multi_model.fit...The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out. Good luck! Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train? 这篇关于将多GPU与keras.utils.multi_gpu_model一起使用时,SageMaker失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 09-25 07:35