您好,我正在尝试将文本分为4个类别,一个是我要打印的,另一个是预测,即文本属于每个类别的概率。
阅读Scikit-learn的文档后,我认为我应该使用predict_proba
到目前为止,我的代码是这样的:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
import sys
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.datasets import load_files
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

string = sys.argv[1] #i will pass text to predict from console
sets = load_files('scikit') #load training set




count_vect = CountVectorizer(analyzer='char_wb', ngram_range=(0, 3), min_df=1)
X_train_counts = count_vect.fit_transform(sets.data)


tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)


tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)



clf = MultinomialNB().fit(X_train_tfidf, sets.target)
docs_new = [string]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
     print('%r => %s' % (doc, sets.target_names[category])) #print prediction , and it is correct
     print(clf.predict_proba(sets.target_names)) #trying to get prob for al classes


可悲的是,输出是这样的:ValueError: objects are not aligned,我尝试了许多不同的方法来实现这一目标,并在网络上进行了大量搜索,但是似乎都没有用。
任何建议,将不胜感激。谢谢
尼科

最佳答案

对predict_proba()函数的输入应与您对predict()方法的输入完全相同。因此,您将获得

clf.predict_proba(X_new_tfidf)

10-08 04:12