在 sklearn 管道中对分类变量实施 KNN 插补

本文介绍了在 sklearn 管道中对分类变量实施 KNN 插补的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 sklearn 的管道转换器实现预处理管道.我的管道包括 sklearn 的 KNNImputer 估计器，我想用它来估算数据集中的分类特征.(我的问题类似于这个线程，但它不包含我的问题的答案:如何实现 KNN 以估算 sklearn 管道中的分类特征)

I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it doesn't contain the answer to my question: How to implement KNN to impute categorical features in a sklearn pipeline)

我知道在插补之前必须对分类特征进行编码，这就是我遇到麻烦的地方.使用标准标签/序数/onehot 编码器，当尝试使用缺失值 (np.nan) 对分类特征进行编码时，您会收到以下错误:

I know that the categorical features have to be encoded before imputation and this is where I am having trouble. With standard label/ordinal/onehot encoders, when trying to encode categorical features with missing values (np.nan) you get the following error:

ValueError: Input contains NaN

我已经设法绕过"了通过创建一个自定义编码器，我将 np.nan 替换为Missing":

I've managed to "by-pass" that by creating a custom encoder where I replace the np.nan with 'Missing':

class CustomEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = None

    def fit(self, X, y=None):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit(X.fillna('Missing'))

    def transform(self, X, y=None):
        return self.encoder.transform(X.fillna('Missing'))

    def fit_transform(self, X, y=None, **fit_params):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit_transform(X.fillna('Missing'))

preprocessor = ColumnTransformer([
    ('categoricals', CustomEncoder(), cat_features),
    ('numericals', StandardScaler(), num_features)],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('imputing', KNNImputer(n_neighbors=5))
])

然而，在这种情况下，我找不到一种合理的方法，然后在使用 KNNImputer 进行输入之前将编码的缺失"值设置回 np.nan.

In this scenario however I cannot find a reasonable way to then set the encoded 'Missing' values back to np.nan before imputing with the KNNImputer.

我已经读到我可以在这个线程上使用 OneHotEncoder 转换器手动执行此操作:Scikit-learn 中 OneHotEncoder 和 KNNImpute 之间的循环循环，但同样，我想在管道中实现所有这些，以自动化整个预处理阶段.

I've read that I could do this manually using the OneHotEncoder transformer on this thread: Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn, but again, I'd like to implement all of this in a pipeline to automate the entire pre-processing phase.

有没有人设法做到这一点?有人会推荐替代解决方案吗?使用 KNN 算法进行估算可能不值得麻烦，我应该改用简单的估算器吗?

Has anyone managed to do this? Would anyone recommend an alternative solution? Is imputing with a KNN algorithm maybe not worth the trouble and should I use a simple imputer instead?

提前感谢您的反馈！

推荐答案

恐怕这行不通.如果您对分类数据进行单热编码，您的缺失值将被编码为一个新的二进制变量，而 KNNImputer 将无法处理它们，因为:

I am afraid that this cannot work. If you one-hot encode your categorical data, your missing values will be encoded into a new binary variable and KNNImputer will fail to deal with them because:

它一次适用于每一列，而不适用于完整的单热编码列
不会再有任何遗漏需要处理了

无论如何，您有几个选项可以使用 scikit-learn 来估算缺失的分类变量:

Anyway, you have a couple of options for imputing missing categorical variables using scikit-learn:

你可以使用sklearn.impute.SimpleImputer 使用 strategy="most_frequent":这将使用每列中最频繁的值替换缺失值，无论它们是字符串还是数值数据
使用 sklearn.impute.KNNImputer 有一些限制:您必须首先将分类特征转换为数字特征，同时保留 NaN 值(请参阅:LabelEncoder 将缺失值保持为 'NaN')，然后您可以使用 KNNImputer 仅使用最近的邻居作为替代(如果您使用多个邻居，它将呈现一些毫无意义的平均值).例如:

you can use sklearn.impute.SimpleImputer using strategy="most_frequent": this will replace missing values using the most frequent value along each column, no matter if they are strings or numeric data
use sklearn.impute.KNNImputer with some limitation: you have first to transform your categorical features into numeric ones while preserving the NaN values (see: LabelEncoder that keeps missing values as 'NaN'), then you can use the KNNImputer using only the nearest neighbour as replacement (if you use more than one neighbour it will render some meaningless average). For example:

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.impute import KNNImputer

    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    df = df.apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))

    imputer = KNNImputer(n_neighbors=1)
    imputer.fit_transform(df)

    In:
        A   B   C
    0   x   1   2.0
    1   NaN 6   1.0
    2   z   9   NaN

    Out:
    array([[0., 0., 1.],
           [0., 1., 0.],
           [1., 2., 0.]])

使用 sklearn.impute.IterativeImputer 并复制一个 MissForest 输入器混合数据(但您必须将数字与分类特征分开处理).例如:

Use sklearn.impute.IterativeImputer and replicate a MissForest imputer for mixed data (but you will have to processe separately numeric from categorical features). For example:

    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    categorical = ['A']
    numerical = ['B', 'C']

    df[categorical] = df[categorical].apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))

    print(df)

    imp_num = IterativeImputer(estimator=RandomForestRegressor(),
                               initial_strategy='mean',
                               max_iter=10, random_state=0)
    imp_cat = IterativeImputer(estimator=RandomForestClassifier(),
                               initial_strategy='most_frequent',
                               max_iter=10, random_state=0)

    df[numerical] = imp_num.fit_transform(df[numerical])
    df[categorical] = imp_cat.fit_transform(df[categorical])

    print(df)

                        这篇关于在 sklearn 管道中对分类变量实施 KNN 插补的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！