regex - 在R中拆分字符串列FAST

我有一个107列和745000行的数据框(比我的示例大得多)。

情况是，我有要分隔的字符类型列，因为它们似乎在每个序列的末尾都包含一些类似类型的结尾。

我想将这些类型结尾的部分保留到新的列中。

我已经制定了自己的解决方案，但是对于遍历所有745000行53次而言，它似乎太慢了。

因此，我将解决方案嵌入以下代码中，并包含一些任意数据:

set.seed(1)
code_1 <- paste0(round(runif(5000, 100000, 999999)), "_", round(runif(1000, 1, 15)))
code_2 <- sample(c(paste0(round(runif(10, 100000, 999999)), "_", round(runif(10, 1, 15))), NA), 5000, replace = TRUE)
code_3 <- sample(c(paste0(round(runif(3, 100000, 999999)), "_", round(runif(3, 1, 15))), NA), 5000, replace = TRUE)
code_4 <- sample(c(paste0(round(runif(1, 100000, 999999)), "_", round(runif(1, 1, 15))), NA), 5000, replace = TRUE)

code_type_1 <- rep(NA, 5000)
code_type_2 <- rep(NA, 5000)
code_type_3 <- rep(NA, 5000)
code_type_4 <- rep(NA, 5000)

df <- data.frame(cbind(code_1,
                       code_2,
                       code_3,
                       code_4,
                       code_type_1,
                       code_type_2,
                       code_type_3,
                       code_type_4),
                 stringsAsFactors = FALSE)

df_new <- data.frame(code_1 = character(),
                     code_2 = character(),
                     code_3 = character(),
                     code_4 = character(),
                     code_type_1 = character(),
                     code_type_2 = character(),
                     code_type_3 = character(),
                     code_type_4 = character(),
                     stringsAsFactors = FALSE)

for (i in 1:4) {
  i_t <- i + 4
  temp <- strsplit(df[, c(i)], "[_]")
  for (j in 1:nrow(df)) {
    df_new[c(j), c(i)] <- unlist(temp[j])[1]
    df_new[c(j), c(i_t)] <- ifelse(is.na(unlist(temp[j])[1]), NA, unlist(temp[j])[2])
  }
  print(i)
}

for (i in 1:8) {
 df_new[, c(i)] <- factor(df_new[, c(i)])
}

有谁知道如何加快这里的速度？

最佳答案

首先，我们将结果data.frame预分配给所需的最终长度。这个非常重要;参见The R Inferno, Circle 2。然后，我们将内部循环矢量化。我们还使用fixed = TRUE，并避免使用strsplit中的正则表达式。

system.time({
  df_new1 <- data.frame(code_1 = character(nrow(df)),
                       code_2 = character(nrow(df)),
                       code_3 = character(nrow(df)),
                       code_4 = character(nrow(df)),
                       code_type_1 = character(nrow(df)),
                       code_type_2 = character(nrow(df)),
                       code_type_3 = character(nrow(df)),
                       code_type_4 = character(nrow(df)),
                       stringsAsFactors = FALSE)

  for (i in 1:4) {
    i_t <- i + 4
    temp <- do.call(rbind, strsplit(df[, c(i)], "_", fixed = TRUE))

    df_new1[, i] <- temp[,1]
    df_new1[, i_t] <- ifelse(is.na(temp[,1]), NA, temp[,2])
  }

  df_new1[] <- lapply(df_new1, factor)
})
#   user      system     elapsed
#  0.029       0.000       0.029

all.equal(df_new, df_new1)
#[1] TRUE

当然，有一些方法可以使此过程更快，但这与您原来的方法很接近，应该足够了。