本文介绍了如何编目数据集 &通过 S3 URI 建模,但保留本地副本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚如何在本地和 S3 上存储中间 Kedro 管道对象.特别是,假设我在 S3 上有一个数据集:

I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:

my_big_dataset.hdf5:
  type: kedro.extras.datasets.pandas.HDFDataSet
  filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"

我想通过 S3 URI 引用目录中的这些对象,以便我的团队可以使用它们.但是,我想避免每次运行管道时都重新下载数据集、模型权重等,方法是在 S3 副本之外保留本地副本.如何使用 Kedro 镜像文件?

I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?

推荐答案

这是个好问题,Kedro 有 CachedDataSet 用于在同一次运行中缓存数据集,当数据集在同一次运行中多次使用/加载时,它会处理在内存中缓存数据集.没有真正相同的东西在运行中持续存在,一般来说,Kedro 不会做很多持续性的事情.

This is a good question, Kedro has CachedDataSet for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.

也就是说,在我的脑海里,我可以想到两个选项(主要是)复制或提供此功能:

That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:

  1. 在相同的配置环境中使用相同的 catalog,但使用 TemplatedConfigLoader 目录数据集的文件路径如下所示:
my_dataset:
  filepath: ${base_data}/01_raw/blah.csv

并且在生产"中运行时将 base_data 设置为 s3://bucket/blah模式并在本地使用 local_filepath/data.您可以决定在覆盖的 context 方法(无论是使用 local/globals.yml(请参阅上面的链接文档)还是环境变量或其他方法)中具体执行此操作的方式.

and you set base_data to s3://bucket/blah when running in "production" mode and with local_filepath/data locally. You can decide how exactly you do this in your overriden context method (whether it's using local/globals.yml (see the linked documentation above) or environment variables or what not.

  1. 使用单独的环境,可能是 local(这就是它的用途!),您可以在其中保留目录的单独副本,其中文件路径替换为本地路径.
  1. Use separate environments, likely local (it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.

否则,您的下一个最佳选择是编写一个类似于 PersistentCachedDataSetrel="nofollow noreferrer">CachedDataSet 拦截包装数据集的加载/保存,并在第一次加载时在您随后查找的确定性位置制作本地副本加载.

Otherwise, your next best bet is to write a PersistentCachedDataSet similar to CachedDataSet which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.

这篇关于如何编目数据集 &通过 S3 URI 建模,但保留本地副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 19:06