python - 这个热吗 | LabelEncoder

阅读：

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

它指出“使用“一键式又称千分之一”方案对分类整数特征进行编码”。

这是否也意味着一键编码单词列表？

来自一种热编码的Wikipedia定义（https://en.wikipedia.org/wiki/One-hot）
“在自然语言处理中，一个热门向量是一个1×N的矩阵（向量），用于将词汇表中的每个单词与词汇表中的每个其他单词区分开。该矢量在所有单元格中均由0组成，除了单个1在唯一用来识别单词的单元格中。”

在它下面运行的代码似乎LabelEncoder不是一种热编码的正确实现，而OneHotEncoder是一种正确的实现：

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# define example
data = ['w1 w2 w3', 'w1 w2']

values = array(data)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

mlb = MultiLabelBinarizer()

print('fit_transform\n' , mlb.fit_transform(data))
print('\none hot\n' , onehot_encoder.fit_transform(integer_encoded))

版画：

fit_transform
 [[1 1 1 1 1]
 [1 1 1 0 1]]

one hot
 [[0. 1.]
 [1. 0.]]

因此，LabelEncoder不会一键编码，LabelEncoder使用的编码类型是什么？

从上面的输出看来，OneHotEncoder比LabelEncoder的编码方案产生的密度更高。

更新：

如何决定使用LabelEncoder或OneHotEncoder为机器学习算法编码数据？

最佳答案

我认为您的问题还不够清楚...

首先，LabelEncoder编码标签的值在0和n_classes-1之间，而OneHotEncoder使用单发（又称为K之一）方案编码分类整数特征。他们是不同的。

其次，是OneHotEncoder对单词列表进行编码。在Wikipedia定义中，它表示a one-hot vector is a 1 × N matrix。但是什么是N？实际上，N是您的词汇量。

例如，如果您有五个词a, b, c, d, e。然后对它们进行一次热编码：

a -> [1, 0, 0, 0, 0]  # a one-hot 1 x 5 vector
b -> [0, 1, 0, 0, 0]  # a one-hot 1 x 5 vector
c -> [0, 0, 1, 0, 0]  # a one-hot 1 x 5 vector
d -> [0, 0, 0, 1, 0]  # a one-hot 1 x 5 vector
e -> [0, 0, 0, 0, 1]  # a one-hot 1 x 5 vector
# total five one-hot 1 x 5 vectors which can be expressed in a 5 x 5 matrix.

第三，实际上我不确定您要问什么...

最后，回答您的更新问题。大多数时候，您应该选择单编码或word embedding。原因是LabelEncoder生成的向量太相似，这意味着彼此之间没有太大差异。由于相似的输入更有可能导致相似的输出。这使模型难以拟合。

关于python - 这个热吗，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/50579544/