问题描述
我想对我的数据集中的10个特征中的3个分类特征进行编码。我使用来自如下:
I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing
from sklearn.preprocessing to do so as the following:
from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)
但是,由于出现此错误,我无法继续操作:
However, I couldn't proceed as I am getting this error:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG
我很惊讶为什么它抱怨字符串,因为它应该将其转换!我在这里缺少什么吗?
I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?
推荐答案
如果您阅读了 OneHotEncoder
您将看到 fit
的输入是 int类型的输入数组。因此,您需要对一个热编码数据执行两个步骤
If you read the docs for OneHotEncoder
you'll see the input for fit
is "Input array of type int". So you need to do two steps for your one hot encoded data
from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)
输出:
[[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
编辑
从 0.20
开始,这变得容易一些,不仅因为 OneHotEncoder
现在可以很好地处理字符串了,而且因为我们可以使用 ColumnTransformer
轻松地转换多个列,请参见下面的示例e
As of 0.20
this became a bit easier, not only because OneHotEncoder
now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer
, see below for an example
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
X = np.array([['apple', 'red', 1, 'round', 0],
['orange', 'orange', 2, 'round', 0.1],
['bannana', 'yellow', 2, 'long', 0],
['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
[('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),], # the column numbers I want to apply this to
remainder='passthrough' # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string
输出:
[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]
这篇关于有关分类功能的OneHotEncoder问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!