python - 将MultiLabelBinarizer打包到scikit-learn管道中以推断新数据

我正在构建一个多标签分类器，以基于文本字段预测标签。例如，根据电影标题预测类型。我想使用MultiLabelBinarizer()对包含所有适用流派标签的列进行二值化。例如，['action','comedy','drama']分为三列，分别为0/1值。

我使用MultiLabelBinarizer()的原因是为了可以使用内置的inverse_transform()函数将输出数组（例如array([0, 0, 1, 0, 1])直接转换为用户友好的文本输出（['action','drama']））。

分类器有效，但是我在预测新数据时遇到问题。我找不到将MultiLabelBinarizer()集成到管道中的方法，以便可以保存和重新加载以推断新数据。一种解决方案是将其另存为泡菜对象，然后每次重新加载，但我想避免在生产中具有这种依赖性。

我知道这类似于我在Pipeline中内置的tf-idf向量，但在某种意义上说，它是应用于目标列（类型标签）而不是我的自变量（文本注释），这是不同的。这是我用于训练多标签SVM的代码：

def svm_train(df):
  mlb = MultiLabelBinarizer()
  y = mlb.fit_transform(df['Genres'])

  with mlflow.start_run():
    x_train, x_test, y_train, y_test = train_test_split(df['Movie Title'], y, test_size=0.3)

    # Instantiate TF-IDF Vectorizer and SVM Model
    tfidf_vect = TfidfVectorizer()
    mdl = OneVsRestClassifier(LinearSVC(loss='hinge'))
    svm_pipeline = Pipeline([('tfidf', tfidf_vect), ('clf', mdl)])

    svm_pipeline.fit(x_train, y_train)
    prediction = svm_pipeline.predict(x_test)

    report = classification_report(y_test, prediction, target_names=mlb.classes_)

    mlflow.sklearn.log_model(svm_pipeline, "Multilabel Classifier")
    mlflow.log_artifact(mlb, "MLB")

  return(report)

svm_train(df)

推论包括在另一个Databricks笔记本中从MLflow重新加载保存的模型（与在pickle文件中重新加载相同），并使用管道进行预测：

def predict_labels(new_data):
  model_uri = '...MLflow path...'
  model = mlflow.sklearn.load_model(model_uri)
  predictions = model.predict(new_data)
  # If I can't package the MultiLabelBinarizer() into the Pipeline, this
  # is where I'd have to load the pickle object mlb
  # so that I can inverse_transform()
  return mlb.inverse_transform(predictions)

new_data = ['Some movie title']
predict_labels(new_data)

['action','comedy']

这是我正在使用的所有库：

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import glob, os
from pyspark.sql import DataFrame
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

最佳答案

对于您的用例，您可能需要考虑使用MLflow's functionality for persisting custom models。根据the docs：

虽然MLflow的内置模型持久性实用程序很容易以MLflow Model格式打包来自各种流行的ML库的模型，但它们并不能涵盖所有用例。例如，您可能想使用MLflow的内置样式未明确支持的ML库中的模型。或者，您可能希望打包自定义推理代码和数据以创建MLflow模型。幸运的是，MLflow提供了两个可用于完成这些任务的解决方案：定制Python模型和定制风味。

特别是，您应该能够以与链接示例中的XGBoost模型类似的方式将MultiLabelIndexer作为工件与Sklearn模型一起记录下来，然后在预测时将其重新加载，例如：

# Save sklearn model & multilabel indexer to paths on the local filesystem
sklearn_model_path = "some/local/path"
labelindexer_path = "another/local/path"
# ... save your models objects here to sklearn_model_path and labelindexer_path

# Define the custom model class
import mlflow.pyfunc
class SklearnWrapper(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle, mlflow
        with open(context["indexer_path"], 'rb') as handle:
            self.indexer = pickle.load(handle)
        self.pipeline = mlflow.sklearn.load_model("pipeline_path")

    def predict(self, context, model_input):
        pipeline_preds = self.pipeline.predict(model_input)
        return self.indexer.inverse_transform(pipeline_preds)

# Create a Conda environment for the new MLflow Model that contains the XGBoost library
# as a dependency, as well as the required CloudPickle library
import cloudpickle
import sklearn
conda_env = {
    'channels': ['defaults'],
    'dependencies': [
      'sklearn={}'.format(sklearn.__version__),
      'cloudpickle={}'.format(cloudpickle.__version__),
    ],
    'name': 'sklearn_env'
}

# Save the MLflow Model
artifacts = {
    "pipeline_path": sklearn_model_path,
    "indexer_path": labelindexer_path,
}
mlflow_pyfunc_model_path = "sklearn_mlflow_pyfunc"
mlflow.pyfunc.save_model(
        path=mlflow_pyfunc_model_path, python_model=XGBWrapper(), artifacts=artifacts,
        conda_env=conda_env)

# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)
# Predict on a pandas DataFrame
import pandas as pd
loaded_model.predict(pd.DataFrame(...))

请注意，我们的自定义模型仍会加载回MultiLabelIndexer，但是MLflow将保留索引器以及您的管道和自定义模型逻辑，以便您可以将模型视为生产部署的单个一致单元。

关于python - 将MultiLabelBinarizer打包到scikit-learn管道中以推断新数据，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/57924929/