问题描述
我正在尝试构建一个多标签的核心文本分类器.如此处所述,该想法是阅读(大规模)文本批量处理数据集,并将其部分拟合到分类器中.此外,如果您有此处所述的多标签实例,则想法是以一对多的方式构建这么多的二进制分类器作为数据集中的类数.
I am trying to build a multi-label out-of-core text classifier. As described here, the idea is to read (large scale) text data sets in batches and partially fitting them to the classifiers. Additionally, when you have multi-label instances as described here, the idea is to build that many binary classifiers as the number of classes in the data set, in an One-Vs-All manner.
将sklearn的MultiLabelBinarizer和OneVsRestClassifier类与部分拟合组合时,出现以下错误:
When combining the MultiLabelBinarizer and OneVsRestClassifier classes of sklearn with partial fitting I get the following error:
代码如下:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]
mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18, non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))
X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)
您可以想象最后三行应用于每个小批量,为简单起见,我已删除了其中的代码.
You can imagine that the last three lines are applied to each minibatch, the code of which I have removed for the sake of simplicity.
如果删除OneVsRestClassifier并仅使用MultinomialNB,则代码可以正常运行.
If you remove the OneVsRestClassifier and use MultinomialNB only, the code runs fine.
推荐答案
您正在传递y_train,它是从MultiLabelBinarizer
转换而来的,格式为[[1,1,0],[0,1,0], [1,1,0]],但将类别作为['a','b','c']
传递,然后通过此行代码:-
You are passing y_train as transformed from MultiLabelBinarizer
which are in the form of [[1, 1, 0], [0, 1, 0], [1, 1, 0]], but passing categories as ['a','b','c']
which is then passed through this line the code:-
if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
"must be subset of {1}").format(np.unique(y),
self.classes_))
会产生一组布尔值,例如[False,True,..].if
无法将这样的数组作为单个真值来处理,因此也就无法处理错误.
which results in a array of boolean values such as [False, True, ..].if
cannot handle such arrays as a single truth value and hence the error.
第一件事是您应该以与Y_train
相同的数字格式传递类.现在,即使您这样做,也可以 OneVsRestClassifier的内部label_binarizer_
将确定其类型为"multiclass"而不是multilabel
,然后拒绝正确转换类.我认为这是OneVsRestClassifer和/或LabelBinarizer中的错误.
First thing is you should pass classes in same numerical format as Y_train
.Now even if you do that, then the internal label_binarizer_
of OneVsRestClassifier will decide that it is of type "multiclass" rather than multilabel
and will then refuse to transform the classes correctly. This in my opinion is a bug in OneVsRestClassifer and/or LabelBinarizer.
请向scikit-learn github提交有关partial_fit
的问题,看看会发生什么.
Please submit an issue to scikit-learn github about partial_fit
and see what happens.
更新显然,从目标向量(y)决定多标签"或多类"是scikit-learn上一个严重的问题,因为它涉及所有复杂问题.
UpdateApparently, deciding "multilabel" or "multiclass" from target vector (y) is a currenlty ongoing issue on scikit-learn because of all the complications surrounding it.
- https://github.com/scikit-learn/scikit-learn /issues/7665
- https://github.com/scikit-learn/scikit-learn /issues/5959
- https://github.com/scikit-learn/scikit-learn /issues/7931
- https://github.com/scikit-learn/scikit-learn /issues/8098
- https://github.com/scikit-learn/scikit-learn /issues/7628
- https://github.com/scikit-learn/scikit-learn /pull/2626
- https://github.com/scikit-learn/scikit-learn/issues/7665
- https://github.com/scikit-learn/scikit-learn/issues/5959
- https://github.com/scikit-learn/scikit-learn/issues/7931
- https://github.com/scikit-learn/scikit-learn/issues/8098
- https://github.com/scikit-learn/scikit-learn/issues/7628
- https://github.com/scikit-learn/scikit-learn/pull/2626
这篇关于文本数据的多标签核心学习:部分拟合时出现ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!