r - 如何向量化R strsplit？

创建使用strsplit的函数时，矢量输入的行为不理想，因此需要使用sapply。这是由于strsplit产生的列表输出。有没有一种方法可以对过程进行矢量化处理-也就是说，该函数会在列表中为输入的每个元素生成正确的元素？

例如，要计算字符向量中单词的长度：

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox
1     5     5     3
# Success, but potentially very slow

理想情况下，类似length(strsplit(words,"")[[.]])的情况，其中.被解释为输入向量的相关部分。

最佳答案

通常，您应该首先使用向量化函数。之后，使用strsplit通常会需要某种迭代（这会更慢），因此，如果可能的话，请避免使用它。在您的示例中，应改为使用nchar：

> nchar(words)
[1] 1 5 5 3

更一般而言，利用strsplit返回列表并使用lapply的事实：

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

否则，请使用l*ply中的plyr系列功能。例如：

> laply(strsplit(words,""), length)
[1] 1 5 5 3

编辑：

为了纪念Bloomsday，我决定使用乔伊斯的《尤利西斯》测试这些方法的性能：

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

既然我已经掌握了所有的话，我们就可以做点算了：

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   3.000   4.000   4.666   6.000  69.000
   user  system elapsed
   2.65    0.03    2.73
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   3.000   4.000   4.666   6.000  69.000
   user  system elapsed
   0.05    0.00    0.04
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   3.000   4.000   4.666   6.000  69.000
   user  system elapsed
    0.8     0.0     0.8
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   3.000   4.000   4.666   6.000  69.000
   user  system elapsed
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1
 Min.   : 0.000
 1st Qu.: 3.000
 Median : 4.000
 Mean   : 4.666
 3rd Qu.: 6.000
 Max.   :69.000
   user  system elapsed
   7.97    0.00    8.03

向量化函数和lapply大大快于原始sapply版本。所有解决方案都返回相同的答案（如摘要输出所示）。

显然，最新版本的plyr更快（此版本使用的是稍旧的版本）。

关于r - 如何向量化R strsplit？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/3054612/