问题描述
我使用 SciPy 和 scikit-learn 训练和应用多项朴素贝叶斯分类器进行二进制文本分类.精确地,我使用模块 sklearn.feature_extraction.text.CountVectorizer
用于创建稀疏矩阵,该稀疏矩阵包含来自文本和模块作为分类器实现,用于在训练数据上训练分类器并将其应用于测试数据.
I use SciPy and scikit-learn to train and apply a Multinomial Naive Bayes Classifier for binary text classification. Precisely, I use the module sklearn.feature_extraction.text.CountVectorizer
for creating sparse matrices that hold word feature counts from text and the module sklearn.naive_bayes.MultinomialNB
as the classifier implementation for training the classifier on training data and applying it on test data.
CountVectorizer
的输入是表示为unicode字符串的文本文档列表.训练数据比测试数据大得多.我的代码如下所示(简化):
The input to the CountVectorizer
is a list of text documents represented as unicode strings. The training data is much larger than the test data. My code looks like this (simplified):
vectorizer = CountVectorizer(**kwargs)
# sparse matrix with training data
X_train = vectorizer.fit_transform(list_of_documents_for_training)
# vector holding target values (=classes, either -1 or 1) for training documents
# this vector has the same number of elements as the list of documents
y_train = numpy.array([1, 1, 1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, ...])
# sparse matrix with test data
X_test = vectorizer.fit_transform(list_of_documents_for_testing)
# Training stage of NB classifier
classifier = MultinomialNB()
classifier.fit(X=X_train, y=y_train)
# Prediction of log probabilities on test data
X_log_proba = classifier.predict_log_proba(X_test)
问题::只要,我得到了ValueError: dimension mismatch
.根据下面的IPython stacktrace,该错误发生在SciPy中:
Problem: As soon as MultinomialNB.predict_log_proba()
is called, I get ValueError: dimension mismatch
. According to the IPython stacktrace below, the error occurs in SciPy:
/path/to/my/code.pyc
--> 177 X_log_proba = classifier.predict_log_proba(X_test)
/.../sklearn/naive_bayes.pyc in predict_log_proba(self, X)
76 in the model, where classes are ordered arithmetically.
77 """
--> 78 jll = self._joint_log_likelihood(X)
79 # normalize by P(x) = P(f_1, ..., f_n)
80 log_prob_x = logsumexp(jll, axis=1)
/.../sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
345 """Calculate the posterior log probability of the samples X"""
346 X = atleast2d_or_csr(X)
--> 347 return (safe_sparse_dot(X, self.feature_log_prob_.T)
348 + self.class_log_prior_)
349
/.../sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
71 from scipy import sparse
72 if sparse.issparse(a) or sparse.issparse(b):
--> 73 ret = a * b
74 if dense_output and hasattr(ret, "toarray"):
75 ret = ret.toarray()
/.../scipy/sparse/base.pyc in __mul__(self, other)
276
277 if other.shape[0] != self.shape[1]:
--> 278 raise ValueError('dimension mismatch')
279
280 result = self._mul_multivector(np.asarray(other))
我不知道为什么会发生此错误.有人可以向我解释一下,并提供解决此问题的方法吗?提前非常感谢!
I have no idea why this error occurs. Can anybody please explain it to me and provide a solution for this problem? Thanks a lot in advance!
推荐答案
对我来说,就像您只需要对测试数据集使用vectorizer.transform
一样,因为训练数据集可以修复词汇表(您无法知道完整的词汇表,包括毕竟是训练集).只是要清楚,多数民众赞成在vectorizer.transform
而不是vectorizer.fit_transform
.
Sounds to me, like you just need to use vectorizer.transform
for the test dataset, since the training dataset fixes the vocabulary (you cannot know the full vocabulary including the training set afterall). Just to be clear, thats vectorizer.transform
instead of vectorizer.fit_transform
.
这篇关于SciPy和scikit-学习-ValueError:尺寸不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!