mlflow 如何使用自定义转换器保存 sklearn 管道?

本文介绍了mlflow 如何使用自定义转换器保存 sklearn 管道?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 mlflow 保存 sklearn 机器学习模型，这是一个包含我定义的自定义转换器的管道，并将其加载到另一个项目中.我的自定义转换器继承自 BaseEstimator 和 TransformerMixin.

I am trying to save with mlflow a sklearn machine-learning model, which is a pipeline containing a custom transformer I have defined, and load it in another project.My custom transformer inherits from BaseEstimator and TransformerMixin.

假设我有 2 个项目:

Let's say I have 2 projects:

train_project:它在 src.ml.transformers.py 中有自定义转换器
use_project:它在 src 中有其他东西，或者根本没有 src 目录

所以在我的 train_project 中:

So in my train_project I do :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe')

然后当我尝试将其加载到 use_project 中时:

and then when I try to load it into use_project :

preprocess_pipe = mlflow.sklearn.load_model(f'{ref_model_path}/preprocess_pipe')

出现错误:

[...]
File "/home/quentin/anaconda3/envs/api_env/lib/python3.7/site-packages/mlflow/sklearn.py", line 210, in _load_model_from_local_file
    return pickle.load(f)
ModuleNotFoundError: No module named 'train_project'

我尝试使用格式 mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE :

I tried to use format mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE :

mlflow.sklearn.log_model(preprocess_pipe, 'model/preprocess_pipe', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_CLOUDPICKLE)

但我在加载过程中遇到同样的错误.

but I get the same error during load.

我在 mlflow.pyfunc.log_model 中看到了选项 code_path，但我不清楚它的用途和目的.

I saw option code_path into mlflow.pyfunc.log_model but its use and purpose is not clear to me.

我认为 mlflow 提供了一种简单的方法来保存模型并将它们序列化，以便它们可以在任何地方使用，只有当您拥有本机 sklearn 模型(或 keras，...)时才这样吗?

I thought mlflow provide a easy way to save model and serialize them so they can be used anywhere, Is that true only if you have native sklearn models (or keras, ...)?

似乎这个问题与pickle功能更相关(mlflow使用它并且pickle需要安装所有依赖项).

It's seem that this issue is more related to pickle functioning (mlflow use it and pickle needs to have all dependencies installed).

到目前为止，我找到的唯一解决方案是将我的转换器制作成一个包，并将其导入到两个项目中.使用 log_model 的 conda_env 参数保存我的转换器库的版本，并在我将模型加载到我的 use_project 中时检查它的版本是否相同.但是如果我必须改变我的变压器或调试它是很痛苦的......

The only solution I found so far is to make my transformer a package, import it in both project. Save version of my transformer library with conda_env argument of log_model, and check if it's same version when I load the model into my use_project.But it's painfull if I have to change my transformer or debug in it...

有人有更好的解决方案吗?更优雅?也许我会错过一些 mlflow 功能?

Is anybody have a better solution? More elegent? Maybe there is some mlflow functionality I would have missed?

其他信息:
在 linux (ubuntu) 上工作
毫升流=1.5.0
蟒蛇=3.7.3

other informations :
working on linux (ubuntu)
mlflow=1.5.0
python=3.7.3

我在 mlflow.sklearn api 的测试中看到他们使用自定义转换器进行测试，但他们将其加载到同一个文件中，因此似乎无法解决我的问题，但也许它可以帮助其他人:

I saw in test of mlflow.sklearn api that they do a test with custom transformer, but they load it into the same file so it seems not resolve my issue but maybe it can helps other poeple :

https://github.com/mlflow/mlflow/blob/master/tests/sklearn/test_sklearn_model_export.py

MLFlow

mlflow 如何使用自定义转换器保存 sklearn 管道?

问题描述

推荐答案