问题描述
我想弄清楚如何在本地和 S3 上存储中间 Kedro 管道对象.特别是,假设我在 S3 上有一个数据集:
I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"
我想通过 S3 URI 引用目录中的这些对象,以便我的团队可以使用它们.但是,我想避免每次运行管道时都重新下载数据集、模型权重等,方法是在 S3 副本之外保留本地副本.如何使用 Kedro 镜像文件?
I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?
推荐答案
这是个好问题,Kedro 有 CachedDataSet
用于在同一次运行中缓存数据集,当数据集在同一次运行中多次使用/加载时,它会处理在内存中缓存数据集.没有真正相同的东西在运行中持续存在,一般来说,Kedro 不会做很多持续性的事情.
This is a good question, Kedro has CachedDataSet
for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.
也就是说,在我的脑海里,我可以想到两个选项(主要是)复制或提供此功能:
That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:
- 在相同的配置环境中使用相同的
catalog
,但使用TemplatedConfigLoader
目录数据集的文件路径如下所示:
my_dataset:
filepath: ${base_data}/01_raw/blah.csv
并且在生产"中运行时将 base_data
设置为 s3://bucket/blah
模式并在本地使用 local_filepath/data
.您可以决定在覆盖的 context
方法(无论是使用 local/globals.yml
(请参阅上面的链接文档)还是环境变量或其他方法)中具体执行此操作的方式.
and you set base_data
to s3://bucket/blah
when running in "production" mode and with local_filepath/data
locally. You can decide how exactly you do this in your overriden context
method (whether it's using local/globals.yml
(see the linked documentation above) or environment variables or what not.
- 使用单独的环境,可能是
local
(这就是它的用途!),您可以在其中保留目录的单独副本,其中文件路径替换为本地路径.
- Use separate environments, likely
local
(it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.
否则,您的下一个最佳选择是编写一个类似于 PersistentCachedDataSetrel="nofollow noreferrer">CachedDataSet
拦截包装数据集的加载/保存,并在第一次加载时在您随后查找的确定性位置制作本地副本加载.
Otherwise, your next best bet is to write a PersistentCachedDataSet
similar to CachedDataSet
which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.
这篇关于如何编目数据集 &通过 S3 URI 建模,但保留本地副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!