问题描述
我正在使用keras(Tensorflow后端)在有状态的LSTM模型上工作;我无法在多GPU平台上并行化它. 此处是代码链接.我收到以下错误.
I am working on LSTM model with stateful using keras (Tensorflow backend); I cannot parallelize it on multi-GPU platform. here is link to code. I am getting following error.
[[节点:training/cna/gradients/loss/concatenate_1_loss/mul_grad/BroadcastGradientArgs = BroadcastGradientArgs [T = DT_INT32,_class = ["loc:@ loss/concatenate_1_loss/mul"],_ device ="/job:localhost/副本:0/任务:0/gpu:0](训练/cna/gradients/loss/concatenate_1_loss/mul_grad/形状,训练/cna/gradients/loss/concatenate_1_loss/mul_grad/Shape_1)]]]
[[Node: training/cna/gradients/loss/concatenate_1_loss/mul_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@loss/concatenate_1_loss/mul"], _device="/job:localhost/replica:0/task:0/gpu:0"](training/cna/gradients/loss/concatenate_1_loss/mul_grad/Shape, training/cna/gradients/loss/concatenate_1_loss/mul_grad/Shape_1)]]
[[节点:replica_1/sequential_1/dense_1/truediv/_473 = _Recvclient_terminated = false,recv_device ="/job:localhost/replica:0/task:0/cpu:0",send_device ="/job:localhost/副本:0/任务:0/gpu:1,send_device_incarnation = 1,tensor_name =" edge_3032_replica_1/sequential_1/dense_1/truediv,tensor_type = DT_FLOAT,_device ="/job:本地主机/副本:0/任务:0/cpu :0]]
[[Node: replica_1/sequential_1/dense_1/truediv/_473 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:1", send_device_incarnation=1, tensor_name="edge_3032_replica_1/sequential_1/dense_1/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
我正在使用2个批处理大小为256的GPU.请帮忙.
I am using 2 GPU with batch size of 256. Please help.
谢谢.
推荐答案
出现此错误的原因仅仅是因为您将512尺寸的原始批次分成了256尺寸的两个较小批次.
This error seems to happen simply because you're dividing an original batch with size 512 in two smaller batches with size 256.
有状态层需要固定的批处理大小(请参阅模型开头的参数batch_shape
或batch_input_shape
).
Stateful layers require a fixed batch size (see the parameter batch_shape
or batch_input_shape
at the beginning of the model).
您可以尝试重新创建将batch_shape
(或batch_input_shape
)更改为256(如果当前为512)的模型.或者,如果我对当前值有误的话,反之亦然.
You may try to recreate the model changing the batch_shape
(or batch_input_shape
) to 256 (if it's currently 512). Or the other way around if I'm mistaken about the current value.
如果您已经有一个要保留权重的训练模型,则可以创建具有相同类型的图层和相同形状的另一个模型,仅更改输入形状.然后您可以newModel.set_weights(oldModel.get_weights())
If you have already a trained model with weights you want to keep, you can create another model with the same type of layers and the same shapes, changing only the input shape. Then you can newModel.set_weights(oldModel.get_weights())
也就是说,我认为并行化有状态模型并不安全.在有状态模型中,"batch2"是"batch1"的续集.两个批次都代表相同"顺序,并且顺序绝对重要.如果批处理2在批处理1之前得到处理,则您将输入一个反向序列,您的模型将学习到错误的结果.
That said, I don't think it's safe to parallelize a stateful model. In stateful models, "batch2" is the sequel of "batch1". Both batches represent the "same" sequence, and the order is absolutely important. If batch2 gets processed before batch1, you will be inputting an inverted sequence and your model will learn it wrong.
除非您发现Keras文档明确声明可以安全地对有状态模型进行并行化,否则仔细地检查(经过多次尝试)如果并行化的模型始终提供与单个GPU模型相同的结果,您可能会从中受益.
Unless you find it explicitly stated by Keras documentation that you can safely parallelize a stateful model, you might benefit from checking carefully (after lots of attempts) if the parallelized model always gives the same result as the single GPU model.
这篇关于Keras上的多GPU模型(具有状态的LSTM)不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!