python - 在sklearn管道中转换文本数据

给定一系列文本数据，

X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])

我想使用一个sklearn管道产生像

np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 0, 1], [0, 1, 0]])

我最初的尝试

pipe = Pipeline([
    ('encoder', LabelEncoder()),
    ('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))

按照this issue升高TypeError: fit_transform() takes exactly 2 arguments (3 given)。我尝试在LabelEncoder上编辑签名，以便SaneLabelEncoder().fit_transform(X)给出[0 2 1 0 1 2]，但是

pipe = Pipeline([
    ('encoder', SaneLabelEncoder()),
    ('hot', OneHotEncoder(sparse=False))])
print(pipe.fit_transform(X))

给出[[ 1. 1. 1. 1. 1. 1.]]。关于获得所需输出的任何建议？

最佳答案

使用LabelBinarizer：

import numpy as np
from sklearn import preprocessing
X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])
binar = preprocessing.LabelBinarizer()
X_bin = binar.fit_transform(X)
print X_bin

输出为：

[[1 0 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 0]
 [0 0 1]]

关于python - 在sklearn管道中转换文本数据，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31843008/