本文介绍了用于 Cloud ML 的 Google Storage (gs) 包装器文件输入/输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google 最近发布了 Cloud ML,https://cloud.google.com/ml/它非常有用.但是,一个限制是 Tensorflow 程序的输入/输出应该支持 gs://.

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.

如果我们使用所有 tensorflow API 来读/写文件,应该没问题,因为这些 API 支持 gs://.

If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.

但是,如果我们使用open等原生文件IO API,就不行了,因为他们不理解gs://

However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://

例如:

 with open(vocab_file, 'wb') as f:
        cPickle.dump(self.words, f)

此代码在 Google Cloud ML 中不起作用.

This code won't work in Google Cloud ML.

但是,将所有原生文件 IO API 修改为 tensorflow API 或 Google Storage Python API 确实很乏味.有没有简单的方法可以做到这一点?任何支持谷歌存储系统的包装器,gs:// 在本机文件 IO 之上?

However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?

正如这里所建议的 将腌制的 scipy 稀疏矩阵作为输入数据?,也许我们可以使用 file_io.read_file_to_string('gs://...'),但这仍然需要大量的代码修改.

As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.

推荐答案

一种解决方案是在程序启动时将所有数据复制到本地磁盘.您可以在运行的 Python 脚本中使用 gsutil 执行此操作,例如:

One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:

vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
                       os.path.join('gs://path/to/', vocab_file), '/tmp'])

with open(os.path.join('/tmp', vocab_file), 'wb') as f:
  cPickle.dump(self.words, f)

如果你有任何输出,你可以将它们写入本地磁盘并gsutil rsync.(但是,请注意正确处理重新启动,因为您可能会被放在另一台机器上).

And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).

另一种解决方案是猴子补丁open(注:未经测试):

The other solution is to monkey patch open (Note: untested):

import __builtin__

# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
  return file_io.FileIO(name, mode)

__builtin__.open = new_open

请务必在任何模块实际尝试从 GCS 读取之前执行此操作.

Just be sure to do that before any module actually tries to read from GCS.

这篇关于用于 Cloud ML 的 Google Storage (gs) 包装器文件输入/输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:05