问题描述
我正在尝试使用scikit-learn 0.12.1进行以下操作:
I am trying to use scikit-learn 0.12.1 to:
- 训练LogisticRegression分类器
- 根据保留的验证数据评估分类器
- 向此分类器提供新数据,并为每个观察值检索5个最可能的标签
Sklearn使这一切非常容易,除了一个独特之处.不能保证所有可能的标签都会出现在用于适合我的分类器的数据中.可能有数百种标签,而其中的一些标签尚未出现在可用的训练数据中.
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
这导致2个问题:
- 当验证数据中出现以前未见过的标签时,标签矢量化器无法识别它们.通过将贴标机安装到可能的标签集上,可以很容易地解决此问题,但这会加剧问题2.
- LogisticRegression分类器的predict_proba方法的输出是一个[n_samples,n_classes]数组,其中n_classes仅包含训练数据中看到的类.这意味着在predict_proba数组上运行argsort不再提供直接映射到标签矢量化器词汇的值.
我的问题是,即使分类中的某些未出现在训练数据中,强制分类器识别全部可能类别的最佳方法是什么?显然,要了解从未见过数据的标签将很困难,但是在我的情况下0完全可用.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.
推荐答案
这是一种解决方法.确保您有一个名为all_classes
的所有类的列表.然后,如果clf
是您的LogisticRegression
分类器,
Here's a workaround. Make sure you have a list of all classes called all_classes
. Then, if clf
is your LogisticRegression
classifier,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
产生一个(cls, prob)
对的列表.
这篇关于在没有所有可能标签的情况下训练sklearn LogisticRegression分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!