我试图在站点评论数据库(3个类)上建立文本分类模型。
我清理了DF,对其进行了标记(使用countVectorizer)和Tfidf(TfidfTransformer),并建立了MNB模型。
现在,在我训练并评估了模型之后,我想获得错误预测的列表,以便我可以将其通过LIME并探索使模型混乱的词语。

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve,
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
    labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)

x = cleaned_df["review_text"]
y = cleaned_df["business_category"]

# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)

#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)

# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
    tfidf_x, y, test_size=0.3, random_state=101
)

mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)

predmnb = mnb.predict(x_test)


我的目标是获取模型错误预测的评论的原始索引。

最佳答案

我设法得到这样的结果:

predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]


我确定还有一种更优雅的方式...

关于python - 如何获得关于验证集的错误预测的列表,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/56754153/

10-16 01:06