本文介绍了如何为sklearn CountVectorizer设置自定义停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试在非英语文本数据集上运行LDA(潜在Dirichlet分配).
I'm trying to run LDA (Latent Dirichlet Allocation) on a non-English text dataset.
在sklearn的教程中,您可以在此部分计算要输入到LDA中的单词的词频:
From sklearn's tutorial, there's this part where you count term frequency of the words to feed into the LDA:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
哪个具有内置停用词功能,我认为该功能仅适用于英语.我该如何使用自己的停用词列表?
Which has built-in stop words feature which is only available for English I think. How could I use my own stop words list for this?
推荐答案
您可以将您自己的单词的frozenset
分配给 stop_words
参数,例如:
You may just assign a frozenset
of your own words to the stop_words
argument, e.g.:
stop_words = frozenset(["word1", "word2","word3"])
这篇关于如何为sklearn CountVectorizer设置自定义停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!