问题描述
如果 sklearn.LabelEncoder
已经安装在训练集上,如果在测试集上使用时遇到新值,它可能会中断.
If a sklearn.LabelEncoder
has been fitted on a training set, it might break if it encounters new values when used on a test set.
我能想出的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到 "<unknown>"
,然后显式添加LabelEncoder
对应的类:
The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>"
, and then explicitly add a corresponding class to the LabelEncoder
afterward:
# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
这可行,但有更好的解决方案吗?
This works, but is there a better solution?
更新
正如@sapo_cosmico 在评论中指出的那样,鉴于我认为 LabelEncoder.transform
中的实现更改,现在似乎使用 np.searchsorted
(不知道之前是不是这样).因此,不是将 类附加到
LabelEncoder
的已提取类列表中,而是需要按排序顺序插入:
As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform
, which now seems to use np.searchsorted
(I don't know if it was the case before). So instead of appending the <unknown>
class to the LabelEncoder
's list of already extracted classes, it needs to be inserted in sorted order:
import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes
但是,总而言之,这感觉很笨重,我确信有更好的方法.
However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.
推荐答案
我最终切换到 Pandas 的 get_dummies 由于这个看不见的数据问题.
I ended up switching to Pandas' get_dummies due to this problem of unseen data.
- 在训练数据上创建假人
dummy_train = pd.get_dummies(train)
- 在新的(看不见的数据)中创建假人
dummy_new = pd.get_dummies(new_data)
- 将新数据重新索引到训练数据的列,用0填充缺失值
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
实际上,任何分类的新特征都不会进入分类器,但我认为这不会引起问题,因为它不知道如何处理它们.
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
这篇关于sklearn.LabelEncoder 以前从未见过的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!