有关分类功能的OneHotEncoder问题

本文介绍了有关分类功能的OneHotEncoder问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对我的数据集中的10个特征中的3个分类特征进行编码。我使用来自如下：

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

但是，由于出现此错误，我无法继续操作：

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG

我很惊讶为什么它抱怨字符串，因为它应该将其转换！我在这里缺少什么吗？

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

推荐答案

如果您阅读了 OneHotEncoder 您将看到 fit 的输入是 int类型的输入数组。因此，您需要对一个热编码数据执行两个步骤

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)

输出：

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

编辑

从 0.20 开始，这变得容易一些，不仅因为 OneHotEncoder 现在可以很好地处理字符串了，而且因为我们可以使用 ColumnTransformer 轻松地转换多个列，请参见下面的示例e

As of 0.20 this became a bit easier, not only because OneHotEncoder now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer, see below for an example

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

X = np.array([['apple', 'red', 1, 'round', 0],
              ['orange', 'orange', 2, 'round', 0.1],
              ['bannana', 'yellow', 2, 'long', 0],
              ['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
    [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to
    remainder='passthrough'  # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string

输出：

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
 ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
 ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
 ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]

                        这篇关于有关分类功能的OneHotEncoder问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！