我已经使用joblib保存了我的分类器管道:
vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)
现在,我试图在生产环境中使用它:
def classify(title):
#load classifier and predict
classifier = joblib.load('class.pkl')
#vectorize/transform the new title then predict
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
X_test = vectorizer.transform(title)
predict = classifier.predict(X_test)
return predict
我得到的错误是:ValueError:词汇不适合或为空!
我想我应该从Joblid加载词汇,但是我无法使它正常工作
最佳答案
只需替换:
#load classifier and predict
classifier = joblib.load('class.pkl')
#vectorize/transform the new title then predict
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
X_test = vectorizer.transform(title)
predict = classifier.predict(X_test)
return predict
经过:
# load the saved pipeline that includes both the vectorizer
# and the classifier and predict
classifier = joblib.load('class.pkl')
predict = classifier.predict(X_test)
return predict
class.pkl
包含完整的管道,因此无需创建新的矢量化器实例。如错误消息所述,您需要重用最初训练的矢量化程序,因为从 token (字符串ngram)到列索引的特征映射保存在矢量化程序本身中。此映射称为“词汇表”。关于python - 将分类器引入生产,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25788151/