如何编目数据集 &通过 S3 URI 建模，但保留本地副本?

本文介绍了如何编目数据集 &通过 S3 URI 建模，但保留本地副本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想弄清楚如何在本地和 S3 上存储中间 Kedro 管道对象.特别是，假设我在 S3 上有一个数据集:

I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:

my_big_dataset.hdf5:
  type: kedro.extras.datasets.pandas.HDFDataSet
  filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"

我想通过 S3 URI 引用目录中的这些对象，以便我的团队可以使用它们.但是，我想避免每次运行管道时都重新下载数据集、模型权重等，方法是在 S3 副本之外保留本地副本.如何使用 Kedro 镜像文件?

I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?

推荐答案

这是个好问题，Kedro 有 CachedDataSet 用于在同一次运行中缓存数据集，当数据集在同一次运行中多次使用/加载时，它会处理在内存中缓存数据集.没有真正相同的东西在运行中持续存在，一般来说，Kedro 不会做很多持续性的事情.

This is a good question, Kedro has CachedDataSet for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.

也就是说，在我的脑海里，我可以想到两个选项(主要是)复制或提供此功能:

That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:

在相同的配置环境中使用相同的 catalog，但使用 TemplatedConfigLoader 目录数据集的文件路径如下所示:

my_dataset:
  filepath: ${base_data}/01_raw/blah.csv

并且在生产"中运行时将 base_data 设置为 s3://bucket/blah模式并在本地使用 local_filepath/data.您可以决定在覆盖的 context 方法(无论是使用 local/globals.yml(请参阅上面的链接文档)还是环境变量或其他方法)中具体执行此操作的方式.

and you set base_data to s3://bucket/blah when running in "production" mode and with local_filepath/data locally. You can decide how exactly you do this in your overriden context method (whether it's using local/globals.yml (see the linked documentation above) or environment variables or what not.

使用单独的环境，可能是 local(这就是它的用途！)，您可以在其中保留目录的单独副本，其中文件路径替换为本地路径.

Use separate environments, likely local (it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.

否则，您的下一个最佳选择是编写一个类似于 PersistentCachedDataSetrel="nofollow noreferrer">CachedDataSet 拦截包装数据集的加载/保存，并在第一次加载时在您随后查找的确定性位置制作本地副本加载.

Otherwise, your next best bet is to write a PersistentCachedDataSet similar to CachedDataSet which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.

这篇关于如何编目数据集 &通过 S3 URI 建模，但保留本地副本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！