问题描述
我正在尝试使用gcloud ml-engine jobs submit training
进行训练,并且作业在日志中的以下输出卡住了:
I'm trying to train with gcloud ml-engine jobs submit training
, and job is getting stuck with the following output on logs:
我的config.yaml:
My config.yaml:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
parameterServerType: large_model
workerCount: 1
parameterServerCount: 1
关于"grpc epoll fd:3"是什么意思的任何提示以及如何解决?我的输入函数从gs://提供了一个16G TFRecord,但批处理= 4,随机播放buffer_size =4.每个输入样本都是一个单通道99 x 161px图像:形状(15939,)-不大.
Any hints about what "grpc epoll fd: 3" means and how to fix that? My input function is feeding a 16G TFRecord from gs://, but with batch = 4, shuffle buffer_size = 4. Each input sample is a single channel 99 x 161px image: shape (15939,) - not huge.
谢谢
推荐答案
不确定,这可能是Estimator实现中的错误.目前的解决方案是使用@ guoqing-xu
Maybe this is a bug in the Estimator implementation, not sure. The solution for now is to use tf.estimator.train_and_eval
as suggested by @guoqing-xu
train_input_fn = gen_input(FLAGS.train_input)
eval_input_fn = gen_input(FLAGS.eval_input)
model_params = {
'learning_rate': FLAGS.learning_rate,
}
estimator = tf.estimator.Estimator(model_dir=model_dir, model_fn=model_fn, params=model_params)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn, steps=None, start_delay_secs=30, throttle_secs=30)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
这篇关于ml-engine vague错误:"grpc epoll fd:3";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!