r - 计算数据框列中的单词

我在第一列中有一个带有句子的数据框，我想计算其中的单词:

输入 :

Foo bar
bar example
lalala foo
example sentence foo

输出 :

foo       3
bar       2
example   2
lalala    1
sentence  1

有没有一种简单的方法可以做到这一点？

如果没有，我该怎么做？我看到两种方式:

Append all the sentences in one huge string
And then count the words somehow

(似乎非常低效)
或者

Split the column in multiple columns on spaces " " (I know there's a package for that, can't remember which one)
And then rbind each columns into one

最佳答案

就像你的第二种方法一样。我们可以对空白区域 ( split ) 上的列进行 " " ，然后使用 table 来计算每个单词的频率。此外，输出似乎不区分大小写，因此在拆分之前将列转换为小写。

假设您的数据框名为 df 并且目标列是 V1 。

table(unlist(strsplit(tolower(df$V1), " ")))

 #bar  example      foo   lalala sentence
 #  2        2        3        1        1

如果这需要在数据框中，

data.frame(table(unlist(strsplit(tolower(df$V1), " "))))

#      Var1 Freq
#1      bar    2
#2  example    2
#3      foo    3
#4   lalala    1
#5 sentence    1

编辑

根据 OP 在评论中的更新，如果每个句子都有一个 score 列，并且我们需要为每个单词对它们进行 sum。

添加一个可重现的示例，

df <- data.frame(v1 = c("Foo bar", "bar example", "lalala foo","example sentence foo"),
                 score = c(2, 3, 1, 4))
df

#                    v1 score
#1              Foo bar     2
#2          bar example     3
#3           lalala foo     1
#4 example sentence foo     4

解决这个问题的一种方法是使用包 splitstackshape 和 dplyr 。我们使用 cSplit 将每个句子转换成一个长数据帧，然后对每个单词进行总结，计算频率 (n()) 和 sum。

library(splitstackshape)
library(dplyr)
cSplit(df, "v1", sep = " ", direction = "long") %>%
      group_by(tolower(v1)) %>%
      summarise(Count = n(),
                ScoreSum = sum(score))

#  tolower(v1) Count ScoreSum
#        (chr) (int)    (dbl)
#1         foo     3        7
#2         bar     2        5
#3     example     2        7
#4      lalala     1        1
#5    sentence     1        4

或者只使用 tidyverse

library(tidyverse)

df %>%
  separate_rows(v1, sep = ' ') %>%
  group_by(v1 = tolower(v1)) %>%
  summarise(Count = n(),
            ScoreSum = sum(score))

关于r - 计算数据框列中的单词，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/42742234/