问题描述
我正在做一些关于使用张量流训练深度神经网络的研究.我知道如何训练模型.我的问题是我必须在具有不同数据集的2台不同计算机上训练相同的模型.然后保存模型权重.稍后,我必须以某种方式合并2个模型权重文件.我不知道如何合并它们.是否有一个函数可以做到这一点?或者应该对权重取平均值?
I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?
任何有关此问题的帮助都将有用
Any help on this problem would be useful
预先感谢
推荐答案
最好在训练过程中合并权重更新(梯度)并保留一组通用权重,而不是在单个训练完成后尝试合并权重.两个单独训练的网络都可能找到不同的最优值,例如对权重求平均值可能会使网络在两个数据集上的表现都较差.
It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.
您可以做两件事:
- 看看数据并行训练":将训练过程的前进和后退过程分布在多个计算节点上,每个计算节点都包含全部数据的子集.
在这种情况下通常是:
- 每个节点通过网络向前传播一个小批量
- 每个节点通过网络向后传播损耗梯度
- 主节点"从所有节点上的迷你批收集梯度并相应地更新权重
- 并将权重更新分发回计算节点,以确保每个节点具有相同的权重集
(上面有一些变体,以避免计算节点空闲太长时间以等待其他节点的结果).以上假设运行在计算节点上的Tensorflow进程在训练期间可以相互通信.
(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.
查看 https://www.tensorflow.org/deploy/distributed )更多细节,以及如何在多个节点上训练网络的示例.
Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.
- 如果您确实要单独训练网络,请查看集合,例如参见本页: https://mlwave.com/kaggle-ensembling-guide/.简而言之,您可以在自己的计算机上训练各个网络,然后例如在两个网络的输出上使用平均值或最大值作为组合的分类器/预测器.
- If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.
这篇关于使用Tensorflow合并在2台不同计算机上训练的相同模型的权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!