本文介绍了sklearn.LabelEncoder 以前从未见过的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果 sklearn.LabelEncoder 已经安装在训练集上,如果在测试集上使用时遇到新值,它可能会中断.

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

我能想出的唯一解决方案是将测试集中的所有新内容(即不属于任何现有类)映射到 "<unknown>",然后显式添加LabelEncoder 对应的类:

The only solution I could come up with for this is to map everything new in the test set (i.e. not belonging to any existing class) to "<unknown>", and then explicitly add a corresponding class to the LabelEncoder afterward:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])

这可行,但有更好的解决方案吗?

This works, but is there a better solution?

更新

正如@sapo_cosmico 在评论中指出的那样,鉴于我认为 LabelEncoder.transform 中的实现更改,现在似乎使用 np.searchsorted(不知道之前是不是这样).因此,不是将 类附加到 LabelEncoder 的已提取类列表中,而是需要按排序顺序插入:

As @sapo_cosmico points out in a comment, it seems that the above doesn't work anymore, given what I assume is an implementation change in LabelEncoder.transform, which now seems to use np.searchsorted (I don't know if it was the case before). So instead of appending the <unknown> class to the LabelEncoder's list of already extracted classes, it needs to be inserted in sorted order:

import bisect
le_classes = le.classes_.tolist()
bisect.insort_left(le_classes, '<unknown>')
le.classes_ = le_classes

但是,总而言之,这感觉很笨重,我确信有更好的方法.

However, as this feels pretty clunky all in all, I'm certain there is a better approach for this.

推荐答案

我最终切换到 Pandas 的 get_dummies 由于这个看不见的数据问题.

I ended up switching to Pandas' get_dummies due to this problem of unseen data.

  • 在训练数据上创建假人
    dummy_train = pd.get_dummies(train)
  • 在新的(看不见的数据)中创建假人
    dummy_new = pd.get_dummies(new_data)
  • 将新数据重新索引到训练数据的列,用0填充缺失值
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

实际上,任何分类的新特征都不会进入分类器,但我认为这不会引起问题,因为它不知道如何处理它们.

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

这篇关于sklearn.LabelEncoder 以前从未见过的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 03:03