python - 最有影响力的词出现停用词

我正在运行一些NLP代码，试图在调查中找到最有影响力（正面或负面）的单词。我的问题是，尽管我成功地将一些额外的停用词添加到NLTK停用词文件中，但它们后来却继续作为有影响力的词出现。

因此，我有一个数据框，第一列包含分数，第二列包含注释。

我添加了额外的停用词：

stopwords = stopwords.words('english')
extra = ['Cat', 'Dog']
stopwords.extend(extra)

我检查前后是否使用len方法添加了它们。

我创建此函数以从评论中删除标点符号和停用词：

def text_process(comment):
   nopunc = [char for char in comment if char not in string.punctuation]
   nopunc = ''.join(nopunc)
   return [word for word in nopunc.split() if word.lower() not in stopwords]

我运行模型（由于没有区别，因此不包括整个代码）：

corpus = df['Comment']
y = df['Label']
vectorizer = CountVectorizer(analyzer=text_process)
x = vectorizer.fit_transform(corpus)

...

然后得到最有影响力的词：

feature_to_coef = {word: coef for word, coef in zip(vectorizer.get_feature_names(), nb.coef_[0])}


for best_positive in sorted(
    feature_to_coef.items(),
    key=lambda x: x[1],
    reverse=True)[:20]:
    print (best_positive)

但是，猫和狗都在结果中。

我在做什么错，有什么想法吗？

非常感谢你！

最佳答案

看起来是因为您将大写的“ Cat”和“ Dog”大写

在text_process函数中，您有if word.lower() not in stopwords仅当停用词为小写时才有效

关于python - 最有影响力的词出现停用词，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/56381970/