我在np.array
中有5个句子,我想找到最常见的n个单词。例如,如果n=5
我想要5个最常用的词。我有一个例子如下:
0 rt my mother be on school amp race
1 rt i am a red hair down and its a great
2 rt my for your every day and my chocolate
3 rt i am that red human being a man
4 rt my mother be on school and wear
以下是我用来获取最常见的n个单词的代码。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
n = 5
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)
vocabulary = vectorizer.get_feature_names()
ind = np.argsort(X.toarray().sum(axis=0))[-n:]
top_n_words = [vocabulary[a] for a in ind]
print(top_n_words)
结果如下:
['school', 'am', 'and', 'my', 'rt']
但是,我想要的是从这些最常见的词中忽略停用词,例如'
and
','am
'and
'my
'。我该怎么做? 最佳答案
您只需要在stop_words='english'
中包含参数CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
您现在应该获得:
['wear', 'mother', 'red', 'school', 'rt']
请参阅此处的文档:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
关于python - 从Python的句子集中删除最常用词中的停用词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57073029/