python - SkLearn多项式NB : Most Informative Features

由于我的分类器在测试数据上的准确率约为99％，因此我有点怀疑，想了解NB分类器最有用的功能，以了解它正在学习哪种功能。以下主题非常有用:How to get most informative features for scikit-learn classifiers?

至于我的功能输入，我仍然在玩，现在我正在使用CountVectorizer测试一个简单的unigram模型:

 vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=2, stop_words='english')

在上述主题中，我发现了以下功能:

def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

得到以下结果:

    -16.2420        114th                   -4.0020 said
    -16.2420        115                     -4.6937 obama
    -16.2420        136                     -4.8614 house
    -16.2420        14th                    -5.0194 president
    -16.2420        15th                    -5.1236 state
    -16.2420        1600                    -5.1370 senate
    -16.2420        16th                    -5.3868 new
    -16.2420        1920                    -5.4004 republicans
    -16.2420        1961                    -5.4262 republican
    -16.2420        1981                    -5.5637 democrats
    -16.2420        19th                    -5.6182 congress
    -16.2420        1st                     -5.7314 committee
    -16.2420        31st                    -5.7732 white
    -16.2420        3rd                     -5.8227 security
    -16.2420        4th                     -5.8256 states
    -16.2420        5s                      -5.8530 year
    -16.2420        61                      -5.9099 government
    -16.2420        900                     -5.9464 time
    -16.2420        911                     -5.9984 department
    -16.2420        97                      -6.0273 gop

它可以工作，但是我想知道此函数的作用以解释结果。通常，我对'coef_'属性的作用感到困惑。

我知道左侧是系数最低的前20个要素名称，右侧是系数最高的要素。但是，这是如何工作的，我该如何解释这一概述？这是否意味着左侧具有否定类别的最多信息功能，而右侧具有肯定类别的最多信息功能？

另外，在左侧看起来好像功能名称是按字母顺序排序的，这是正确的吗？

最佳答案

MultinomialNB的coef_属性是朴素贝叶斯模型作为线性分类器模型的重新参数化。对于二元分类问题，这基本上是给定正类别的特征的估计概率的对数。这意味着更高的值意味着积极阶层的更重要特征。

上面的打印内容在第一列中显示了前20个最低值(较少预测特征)，在第二列中显示了前20个高值(最高预测特征)。

关于python - SkLearn多项式NB : Most Informative Features，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/29867367/