python - 在Pipeline sklearn(Python)中使用多个自定义类

我尝试为学生制作有关流水线的教程，但我阻止了。我不是专家，但我正在努力提高。因此，感谢您的放纵。
实际上，我尝试在管道中执行几个步骤来为分类器准备数据帧:

步骤1:数据帧

的描述

步骤2:填写NaN值

步骤3:将分类值转换为数字

这是我的代码:

class Descr_df(object):

    def transform (self, X):
        print ("Structure of the data: \n {}".format(X.head(5)))
        print ("Features names: \n {}".format(X.columns))
        print ("Target: \n {}".format(X.columns[0]))
        print ("Shape of the data: \n {}".format(X.shape))

    def fit(self, X, y=None):
        return self

class Fillna(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        for column in X.columns:
            if column in non_numerics_columns:
                X[column] = X[column].fillna(df[column].value_counts().idxmax())
            else:
                 X[column] = X[column].fillna(X[column].mean())
        return X

    def fit(self, X,y=None):
        return self

class Categorical_to_numerical(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            X[column] = X[column].fillna(X[column].value_counts().idxmax())
            le.fit(X[column])
            X[column] = le.transform(X[column]).astype(int)
        return X

    def fit(self, X, y=None):
        return self

如果我执行步骤1和2或步骤1和3，则可以使用，但是如果我同时执行步骤1、2和3。我有这个错误:

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'

最佳答案

出现此错误是因为在Pipeline中，第一个估计器的输出转到第二个，然后第二个估计器的输出转到第三个，依此类推...

从documentation of Pipeline:

因此，对于您的管道，执行步骤如下:

Descr_df.fit(X)->不执行任何操作并返回自我

newX = Descr_df.transform(X)->应该返回一些值以分配给应传递给下一个估计器的newX，但是您的定义不返回任何值(仅打印)。因此，无隐式返回

Fillna.fit(newX)->不执行任何操作并返回self

Fillna.transform(newX)->调用newX.columns。但是newX =步骤2中没有。因此，错误。

解决方案:更改Descr_df的转换方法以按原样返回数据帧:

def transform (self, X):
    print ("Structure of the data: \n {}".format(X.head(5)))
    print ("Features names: \n {}".format(X.columns))
    print ("Target: \n {}".format(X.columns[0]))
    print ("Shape of the data: \n {}".format(X.shape))
    return X

建议:使您的类继承自scikit中的Base Estimator和Transformer类，以确认是一种好的做法。

即将class Descr_df(object)更改为class Descr_df(BaseEstimator, TransformerMixin)，将Fillna(object)更改为Fillna(BaseEstimator, TransformerMixin)，依此类推。

请参阅以下示例，以获取有关Pipeline中自定义类的更多详细信息:

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

关于python - 在Pipeline sklearn(Python)中使用多个自定义类，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/43499342/