如何保存onehotencoder编码结果呢?scikit learn - Save OneHot Encoder object python - Stack Overflow
正文:
遇见拟合阶段没有看到的数据,在编码时会报错。
如果知道有几种类型,就可以在编码时指定长度。
python - How to add your own categories into the OneHotEncoder - Stack Overflow
Example:
from sklearn.preprocessing import OneHotEncoder
a = [['1'], ['2'], ['3'], ['5']]
enc = OneHotEncoder()
X = enc.fit_transform(a)
enc.transform([['4']])
You can see that my training data does not contain '4', even though '4' is a possible label. so when I encode it and transform '4', it throws an error:
ValueError: Found unknown categories ['4'] in column 0 during transform
解决方案:
There can be two cases here.
1. If you know all the categories beforehand.
Pass all the possible categories as a list when OneHot Encoder is initialized.
enc = OneHotEncoder(categories = [str(i) for i in range(10)])
2. If you don't know some categories beforehand.
# This argument by default is set to `error` hence throws error is an unknown
# category is encountered.
enc = OneHotEncoder(handle_unknown='ignore')
使用:
nvalues = []
n = y_train_pred.shape[1]
num_leaves = [i for i in range(20)]
for j in range(n):
nvalues.append(num_leaves)
nvalues
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
...]]
待预测数据
y_train_pred
array([[2, 2, 3, ..., 0, 5, 0],
[2, 2, 3, ..., 0, 5, 0],
[2, 2, 3, ..., 0, 5, 0],
...,
[2, 2, 3, ..., 0, 5, 0],
[2, 2, 3, ..., 0, 5, 0],
[1, 1, 1, ..., 1, 1, 1]], dtype=int32)
y_train_pred.shape # (465726, 89)
编码
enc = OneHotEncoder(categories = nvalues)
enc.fit(y_train_pred)
enc.categories_
# 编码
train_new_feature = np.array(enc.transform(y_train_pred).toarray())
train_new_feature
array([[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.]])
train_new_feature.shape[1] # 1780
由89个特征,编码成1780个特征。
参考:
sklearn.preprocessing.OneHotEncoder()函数介绍_monster.YC的博客-CSDN博客_preprocessing.onehotencoder()
sklearn.preprocessing.OneHotEncoder — scikit-learn 1.2.0 documentation