本文介绍了如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我尝试远程运行 DataFlow 管道,该管道将使用 pickle 文件.在本地,我可以使用下面的代码来调用文件.
I try to run a DataFlow pipeline remotely which will use a pickle file.Locally, I can use the code below to invoke the file.
with open (known_args.file_path, 'rb') as fp:
file = pickle.load(fp)
但是,当路径是关于云存储时,我发现它无效(gs://...):
However, I find it not valid when the path is about cloud storage(gs://...):
IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'
我有点理解为什么它不起作用,但我找不到正确的方法来做到这一点.
I kind of understand why it is not working but I cannot find the right way to do it.
推荐答案
如果您的 GCS 存储桶中有 pickle 文件,那么您可以将它们加载为 BLOB 并像在您的代码中一样进一步处理它们(使用 pickle.load()
):
If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()
):
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
# usage example:
files = (p
| "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
| "Read blobs" >> beam.ParDo(ReadGcsBlobs())
)
这篇关于如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!