sklearn.LabelEncoder 以前从未见过的值

本文介绍了sklearn.LabelEncoder 以前从未见过的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果 sklearn.LabelEncoder 已经安装在训练集上，如果在测试集上使用时遇到新值，它可能会中断.

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

我能想出的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到 "<unknown>"，然后显式添加LabelEncoder 对应的类:

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

这可行，但有更好的解决方案吗?

This works, but is there a better solution?

更新

正如@sapo_cosmico 在评论中指出的那样，鉴于我认为 LabelEncoder.transform 中的实现更改，现在似乎使用 np.searchsorted(不知道之前是不是这样).因此，不是将类附加到 LabelEncoder 的已提取类列表中，而是需要按排序顺序插入:

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

但是，总而言之，这感觉很笨重，我确信有更好的方法.

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

LabelEncoder

sklearn.LabelEncoder 以前从未见过的值

问题描述

推荐答案