本文介绍了如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
我尝试远程运行 DataFlow 管道,该管道将使用 pickle 文件.在本地,我可以使用下面的代码来调用文件.
I try to run a DataFlow pipeline remotely which will use a pickle file.Locally, I can use the code below to invoke the file.
with open (known_args.file_path, 'rb') as fp:
file = pickle.load(fp)
However, I find it not valid when the path is about cloud storage(gs://...):
IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'
I kind of understand why it is not working but I cannot find the right way to do it.
如果您的 GCS 存储桶中有 pickle 文件,那么您可以将它们加载为 BLOB 并像在您的代码中一样进一步处理它们(使用 pickle.load()
If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
# usage example:
files = (p
| "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
| "Read blobs" >> beam.ParDo(ReadGcsBlobs())
这篇关于如何在 Google Cloud DataFlow 作业中从 GCS 读取 blob(pickle)文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!