问题描述
我实例化了一个 sklearn.feature_extraction.text.CountVectorizer
对象,通过 vocabulary
参数传递词汇表,但我得到一个 sklearn.utils.validation.NotFittedError: CountVectorizer - 未安装词汇.
错误消息.为什么?
I instantiated a sklearn.feature_extraction.text.CountVectorizer
object by passing a vocabulary through the vocabulary
argument, but I get a sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
error message. Why?
示例:
import sklearn.feature_extraction
import numpy as np
import pickle
# Save the vocabulary
ngram_size = 1
dictionary_filepath = 'my_unigram_dictionary'
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1)
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document? This is right.',]
vect = vectorizer.fit(corpus)
print('vect.get_feature_names(): {0}'.format(vect.get_feature_names()))
pickle.dump(vect.vocabulary_, open(dictionary_filepath, 'w'))
# Load the vocabulary
vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=1, vocabulary=vocabulary_to_load)
print('loaded_vectorizer.get_feature_names(): {0}'.format(loaded_vectorizer.get_feature_names()))
输出:
vect.get_feature_names(): [u'and', u'document', u'first', u'is', u'one', u'right', u'second', u'the', u'third', u'this']
Traceback (most recent call last):
File "C:\Users\Francky\Documents\GitHub\adobe\dstc4\test\CountVectorizerSaveDic.py", line 22, in <module>
print('loaded_vectorizer.get_feature_names(): {0}'.format(loaded_vectorizer.get_feature_names()))
File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 890, in get_feature_names
self._check_vocabulary()
File "C:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 271, in _check_vocabulary
check_is_fitted(self, 'vocabulary_', msg=msg),
File "C:\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 627, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
推荐答案
出于某种原因,即使您将 vocabulary=vocabulary_to_load
作为 sklearn.feature_extraction.text.CountVectorizer() 的参数传递
,您仍然需要调用 loaded_vectorizer._validate_vocabulary()
才能调用 loaded_vectorizer.get_feature_names()
.
For some reason, even though you passed vocabulary=vocabulary_to_load
as argument for sklearn.feature_extraction.text.CountVectorizer()
, you still need to call loaded_vectorizer._validate_vocabulary()
before being able to call loaded_vectorizer.get_feature_names()
.
因此,在您的示例中,您应该在使用词汇表创建 CountVectorizer 对象时执行以下操作:
In your example, you should therefore do the following when creating an CountVectorizer object with your vocabulary:
vocabulary_to_load = pickle.load(open(dictionary_filepath, 'r'))
loaded_vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,
ngram_size), min_df=1, vocabulary=vocabulary_to_load)
loaded_vectorizer._validate_vocabulary()
print('loaded_vectorizer.get_feature_names(): {0}'.
format(loaded_vectorizer.get_feature_names()))
这篇关于CountVectorizer:未安装词汇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!