如何在使用 model_main 进行训练的同时连续不断地评估 tensorflow 对象检测模型

本文介绍了如何在使用 model_main 进行训练的同时连续不断地评估 tensorflow 对象检测模型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 train.py 和 eval.py 成功地训练了一个带有自定义示例的对象检测模型.并行运行这两个程序，我能够在训练期间在 tensorboard 中可视化训练和评估指标.

然而，这两个程序都被移到了 legacy 文件夹中，并且 model_main.py 似乎是运行训练和评估的首选方式(通过只执行一个进程).但是，当我使用以下 pipeline.config 启动 model_main.py 时:

train_config {批量大小:1num_steps: 40000...}评估配置{# 整个评估集num_examples: 821# 用于持续评估最大评估:0...}

我在 model_main.py 的输出中看到启用了 INFO 日志记录，训练和评估是按顺序执行的(而不是像以前那样同时执行两个进程)，并且在每个训练步骤之后进行完整的评估发生.

INFO:tensorflow:Saving 'checkpoint_path' 全局步骤 35932 摘要:...信息:tensorflow:将 35933 的检查点保存到 ...信息:tensorflow:调用model_fn....信息:tensorflow:完成调用model_fn.信息:tensorflow:在2018-08-30-10:06:47开始评估...信息:tensorflow:从 .../model.ckpt-35933 恢复参数信息:张量流:运行 local_init_op.信息:tensorflow:完成运行local_init_op.信息:tensorflow:评估 [82/821]...信息:tensorflow:评估 [738/821]信息:tensorflow:评估 [820/821]信息:tensorflow:评估 [821/821]...INFO:tensorflow: 完成评估于 2018-08-30-10:29:35INFO:tensorflow:Saving dict for global step 35933: ...信息:tensorflow:正在保存全局步骤 35933 的checkpoint_path"摘要:.../model.ckpt-35933信息:tensorflow:将 35934 的检查点保存到 .../model.ckpt.信息:tensorflow:调用model_fn....信息:tensorflow:完成调用model_fn.INFO:tensorflow:在2018-08-30-10:29:56开始评估...信息:tensorflow:从 .../model.ckpt-35934 恢复参数

这当然会以几乎没有任何进展的方式减慢训练速度.当我使用 model_main 的命令行参数 --num_eval_steps 将评估步骤减少到 1 时，训练和以前一样快(使用 train.py 和 eval.py)，但是评估指标变得无用(例如，DetectionBoxes_Precision/mAP... 变得恒定并且具有类似的值1、0 甚至 -1).在我看来，它似乎只为同一张图像不断计算这些值.

那么，启动 model_main.py 的正确方法是什么，才能取得合理的快速进展并并行计算整个评估集的评估指标.

解决方案

在 training.py 中有一个类 EvalSpec，它在 main_lib.py 中被调用.它的构造函数有一个名为 throttle_secs 的参数，它设置后续评估之间的间隔，默认值为 600，它在 model_lib.py 中永远不会获得不同的值.如果你有你想要的特定值，你可以简单地改变默认值，但更好的做法当然是将其作为 model_main.py 的参数传递，它将通过 model_lib.py 提供 EvalSpec.

更详细地，将其设置为另一个输入标志flags.DEFINE_integer('throttle_secs', , 'EXPLANATION'),然后throttle_secs=FLAGS.throttle_secs,然后将model_lib.create_train_and_eval_specs改成也接收throttle_secs，并在其中添加到tf.estimator.EvalSpec的调用中.>

我发现您还可以在 .config 文件的 eval_config 中设置 eval_interval_secs.如果这有效(并非所有标志都受支持，因为它们从 eval.py 移动到 model_main.py) - 这显然是一个更简单的解决方案.如果没有 - 使用上面的解决方案.

我尝试在 eval_config 中使用 eval_interval_secs，但没有用，所以您应该使用第一个解决方案.

I successfully trained an object detection model with custom examples using train.py and eval.py. Running both programms in parallel I was able to visualize training and evaluation metrics in tensorboard during training.

However both programs were moved to the legacy folder and model_main.py seems to be the preferred way to run training and evaluation (by executing only a single process). However when I start model_main.py with the following pipeline.config:

train_config {
  batch_size: 1
  num_steps: 40000
  ...
}
eval_config {
  # entire evaluation set
  num_examples: 821
  # for continuous evaluation
  max_evals: 0
  ...
}

I see with enabled INFO logging in the output of model_main.py that training and evaluation are executed sequentially (as opposed to concurrently as before with two processes) and after every single training step a complete evaluation takes place.

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35932: ...
INFO:tensorflow:Saving checkpoints for 35933 into ...
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:06:47
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35933
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [82/821]
...
INFO:tensorflow:Evaluation [738/821]
INFO:tensorflow:Evaluation [820/821]
INFO:tensorflow:Evaluation [821/821]
...
INFO:tensorflow:Finished evaluation at 2018-08-30-10:29:35
INFO:tensorflow:Saving dict for global step 35933: ...
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 35933: .../model.ckpt-35933
INFO:tensorflow:Saving checkpoints for 35934 into .../model.ckpt.
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-08-30-10:29:56
...
INFO:tensorflow:Restoring parameters from .../model.ckpt-35934

This of course slows down training in a way that almost no progress is made. When I reduce the evaluation steps with model_main's command line parameter --num_eval_steps to 1 training is as fast as it was before (using train.py and eval.py), however the evaluation metrics become useless (e.g. the DetectionBoxes_Precision/mAP... become constant and have values like 1, 0 or even -1). To me it seems it is constantly computing these values for the same single image only.

So what is the right way to start model_main.py such that is does make reasonable fast progress and in parallel computes the evaluation metrics from the entire evaluation set.

解决方案

Inside training.py there's a class EvalSpec which is called in main_lib.py.Its constructor has a parameter called throttle_secs which sets the interval between consequent evaluations and has a default value of 600, and it never gets a different value in model_lib.py.If you have a specific value you want, you can simply change the default value,but the better practice of course is to pass it as a parameter of model_main.py which will feed EvalSpec through model_lib.py.

In more details, set it as another input flagflags.DEFINE_integer('throttle_secs', <DEFAULT_VALUE>, 'EXPLANATION'),thenthrottle_secs=FLAGS.throttle_secs,and then change model_lib.create_train_and_eval_specs to also receive throttle_secs, and inside it, add it to the call of tf.estimator.EvalSpec.

EDIT:I found out that you can also set eval_interval_secs in the eval_config of the .config file. In case this works (not all flags are supported since they moved from eval.py to model_main.py) - this is obviously a simpler solution. If not - use the solution above.

EDIT2: I tried using eval_interval_secs in eval_config, and it didn't work, so you should use the first solution.

这篇关于如何在使用 model_main 进行训练的同时连续不断地评估 tensorflow 对象检测模型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！