问题描述
我在ML Engine上启动了一个tensorflow任务,大约2分钟后,我不断收到错误消息"副本主数据0退出,其非零状态为1."
I launch a tensorflow task on ML Engine and after about 2 minutes I keep getting an error message "The replica master 0 exited with a non-zero status of 1."
(该任务在使用ml-engine local时可以正常运行.)
(The task incidentally runs fine with ml-engine local.)
问题:是否可以在任何地方或日志文件中查看有关发生的情况的更多信息?
Question: Is there any place or log file where can I see further information on what happened?
日志查看器仅提供以下内容:
The logs viewer just gives the following:
{
insertId: "ibal72g1rxhr63"
logName: "projects/**-***-ml/logs/ml.googleapis.com%2Fcnn180322_170649"
receiveTimestamp: "2018-03-22T17:08:38.344282172Z"
resource: {
labels: {
job_id: "cnn180322_170649"
project_id: "**-***-ml"
task_name: "service"
}
type: "ml_job"
}
severity: "ERROR"
textPayload: "The replica master 0 exited with a non-zero status of 1."
timestamp: "2018-03-22T17:08:38.344282172Z"
}
在此先感谢任何指针!
推荐答案
对于明显缺少日志文件的解决方案是缺少写入日志的权限.
The solution to the apparent lack of log files was missing permission to write to the logs.
根据IAM&管理员,为cloud-ml-service@<project_id>.iam.gserviceaccount.com
帐户添加了 Logs Writer 角色,从而解决了该问题,并使管理员和工作人员可以按预期将日志消息写入Stackdriver.
Under IAM & admin, adding the Logs Writer role the account cloud-ml-service@<project_id>.iam.gserviceaccount.com
solved the problem and enables the master and workers to write log messages to Stackdriver as expected.
有关类似的讨论和其他信息,请参见自从迁移到V2以来,Cloud ML作业不可用的Stackdriver日志
For a similar discussion and some additional information, see Stackdriver logs not available for Cloud ML jobs since migration to V2
感谢大家的投入!
这篇关于ML Engine上的Tensorflow:副本母版0退出,其非零状态为1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!