我正试着按类别把最常用的10个单词分类。我已经看到了this的答案,但是我不能完全修改它来获得我想要的输出。
category | sentence
A cat runs over big dog
A dog runs over big cat
B random sentences include words
C including this one
所需输出:
category | word/frequency
A runs, 2
cat: 2
dog: 2
over: 2
big: 2
B random: 1
C including: 1
由于我的数据帧很大,我只想得到前10个最经常出现的单词。我也见过这个answer
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
但是这个方法也返回字母的计数。
最佳答案
在标记语句之后,可以连接行并应用FreqDist
df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
输出:
category
a big 2.0
cat 2.0
dog 2.0
over 2.0
runs 2.0
c include 1.0
random 1.0
sentences 1.0
words 1.0
d including 1.0
one 1.0
this 1.0
Name: sentence, dtype: float64