本文介绍了Tensorflow对象检测培训被杀死,资源匮乏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处,部分问题被问到了和此处,且没有后续操作,因此也许这不是举办会议的场所问这个问题,但是我想出了更多的信息,希望可以得到这些问题的答案.

This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.

我一直在尝试在我自己的大约一千张照片的照片库上训练object_detection.我一直在使用提供的管道配置文件"ssd_inception_v2_pets.config".我相信,我已经正确设置了训练数据.该程序似乎可以开始训练.当它无法读取数据时,它会发出错误警报,而我已修复了该错误.

I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config".And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.

我的train_config设置如下,尽管我更改了一些数字,以尝试使其以更少的资源运行.

My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.

train_config: {
  batch_size: 1000 #also tried 1, 10, and 100
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.04  # also tried .004
          decay_steps: 800 # also tried 800720. 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

基本上,我认为正在发生的事情是计算机正在非常快地耗尽资源,而我想知道是否有人进行过优化而花费了更多的时间来构建,却使用了更少的资源?

Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?

或者我对为什么进程被终止感到不对,我是否有办法从内核中获取有关该进程的更多信息?

OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?

这是我被杀死后的Dmesg信息.

This is the Dmesg information that I get after the process is killed.

[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB

推荐答案

我遇到了与您相同的问题.实际上,内存的充分利用是由data_augmentation_options ssd_random_crop引起的,因此您可以删除此选项并将批处理大小设置为8或更小,即2,4.当我将批处理大小设置为1时,我也遇到了由于nan损失而引起的一些问题.

I met the same problem as you did. Actually,the memory full use is caused by the data_augmentation_options ssd_random_crop, so you can remove this option and set the batch size to 8 or smaller ie,2,4. When I set batch size to 1,I also met some problems cause by the nan loss.

另一件事是,根据《深度学习》这本书,参数epsilon应该是一个非常小的数字,例如1e .因为epsilon用于避免分母为零,但此处的默认值为1,我认为将其设置为1是不正确的.

Another thing is that the parameter epsilon should be a very small number, such as 1e according to "deep learning" book. Because epsilon is used to avoid a zero denominator, but the default value here is 1, I don't think it is correct to set it to 1.

这篇关于Tensorflow对象检测培训被杀死,资源匮乏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-18 21:01
查看更多