我正试着按类别把最常用的10个单词分类。我已经看到了this的答案,但是我不能完全修改它来获得我想要的输出。

category | sentence
  A           cat runs over big dog
  A           dog runs over big cat
  B           random sentences include words
  C           including this one

所需输出:
category | word/frequency
   A           runs, 2
               cat: 2
               dog: 2
               over: 2
               big: 2
   B           random: 1
   C           including: 1

由于我的数据帧很大,我只想得到前10个最经常出现的单词。我也见过这个answer
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

但是这个方法也返回字母的计数。

最佳答案

在标记语句之后,可以连接行并应用FreqDist

df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))

输出:
category
a         big          2.0
          cat          2.0
          dog          2.0
          over         2.0
          runs         2.0
c         include      1.0
          random       1.0
          sentences    1.0
          words        1.0
d         including    1.0
          one          1.0
          this         1.0
Name: sentence, dtype: float64

10-08 04:07