我想知道是否可以在ColumnTransformer中使用MultilabelBinarizer。

我有一个玩具 Pandas 数据框,例如:

df = pd.DataFrame({"id":[1,2,3],
"text": ["some text", "some other text", "yet another text"],
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})

preprocess = ColumnTransformer(
    [
     ('vectorizer', CountVectorizer(), 'text'),
    ('binarizer', MultiLabelBinarizer(), ['label']),

    ],
    remainder='drop')

但是,此代码引发异常:
~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    714     with _print_elapsed_time(message_clsname, message):
    715         if hasattr(transformer, 'fit_transform'):
--> 716             res = transformer.fit_transform(X, y, **fit_params)
    717         else:
    718             res = transformer.fit(X, y, **fit_params).transform(X)

TypeError: fit_transform() takes 2 positional arguments but 3 were given

使用OneHotEncoder,ColumnTransformer可以正常工作。

最佳答案

在测试中,我并不是特别努力地确切了解以下内容的工作原理,但是我能够构建一个自定义的<Transformer>,该MultiLabelBinarizer本质上“包装”了<ColumnTransformer>,但也与MultiLabelBinarizer兼容:

class MultiLabelBinarizerFixedTransformer(BaseEstimator, TransformerMixin):
    """
    Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`
    """
    def __init__(
            self
        ):
        self.feature_name = ["mlb"]
        self.mlb = MultiLabelBinarizer(sparse_output=False)

    def fit(self, X, y=None):
        self.mlb.fit(X)
        return self

    def transform(self, X):
        return self.mlb.transform(X)

    def get_feature_names(self, input_features=None):
        cats = self.mlb.classes_
        if input_features is None:
            input_features = ['x%d' % i for i in range(len(cats))]
            print(input_features)
        elif len(input_features) != len(self.categories_):
            raise ValueError(
                "input_features should have length equal to number of "
                "features ({}), got {}".format(len(self.categories_),
                                               len(input_features)))

        feature_names = [f"{input_features[i]}_{cats[i]}" for i in range(len(cats))]
        return np.array(feature_names, dtype=object)

我的直觉是transform()<ColumnTransformer>使用的set of inputs与ojit_code期望的不同。

关于python - 带MultilabelBinarizer的sklearn ColumnTransformer,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59254662/

10-09 17:10