本文介绍了解决:在Google Cloudml BASIC TIER中的设备上没有剩余空间. cloudml中每个层的磁盘大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Cloud ML的BASIC层中为我的模型训练以获取大于20GB的数据时,我的工作失败了,因为Cloudml机器中没有可用的磁盘空间,并且我无法在gcloud ml ml文档中找到任何详细信息[ https://cloud.google.com/ml-engine/docs/tensorflow/机器类型] .

While training my model for data greater than 20GB in BASIC Tier in Cloud ML my jobs are failing because there is no disk space available in the Cloudml machines and I am not able to find any details in gcloud ml documentations [https://cloud.google.com/ml-engine/docs/tensorflow/machine-types].

在确定我的培训工作的TIER时需要帮助,而且工作详细信息"图中的利用率非常低.

Need help in deciding the TIER for my training jobs also the utilisation is very less in Job Details Graphs.

Expand all | Collapse all {
insertId:  "1klpt2"  
jsonPayload: {
created:  1554434546.3576794   
levelname:  "ERROR"   
lineno:  51   
message:  "Failed to train : [Errno 28] No space left on device"   
pathname:  "/root/.local/lib/python3.5/site- 
packages/loggerwrapper.py"   
}
labels: {
compute.googleapis.com/resource_id:  ""   
compute.googleapis.com/resource_name:  "cmle-training- 
10361805218452604847"   
compute.googleapis.com/zone:  ""   
ml.googleapis.com/job_id/log_area:  "root"   
ml.googleapis.com/trial_id:  ""   
}
logName:  "projects/backend/logs/master-replica-0"  
receiveTimestamp:  "2019-03-31T12:32:30.07683Z"  
resource: {
labels: {
job_id:  ""    
project_id:  "backend"    
task_name:  "master-replica-0"    
}
type:  "ml_job"   
}
severity:  "ERROR"  
timestamp:  "2019-03-31T12:32:26.357679367Z"   
}

推荐答案

已解决:此错误不是由于存储空间而来,而是由于共享内存tmfs而来. sklearn适合在训练时消耗了所有共享内存.解决方案:设置 JOBLIB_TEMP_FOLDER /tmp环境变量解决了这个问题.

Solved : This error was coming not because of Storage Space instead coming because of shared memory tmfs. The sklearn fit was consuming all the shared memory while training. Solution : setting JOBLIB_TEMP_FOLDER environment variable , to /tmp solved the problem.

这篇关于解决:在Google Cloudml BASIC TIER中的设备上没有剩余空间. cloudml中每个层的磁盘大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-25 06:36