r - 优化:将数据帧拆分为数据帧列表，每行转换数据

初步:该问题主要具有教育意义，即使方法不是完全最佳，也已完成了实际的任务。我的问题是下面的代码是否可以针对速度进行优化和/或更优雅地实现。也许使用其他包装，例如plyr或reshape。在实际数据上运行大约需要140秒，比模拟数据要高得多，因为某些原始行只包含NA，因此必须进行其他检查。为了进行比较，将在大约30秒内处理模拟数据。

条件:数据集包含360个变量，是12个变量集的30倍。让我们将其命名为V1_1，V1_2 ...(第一个变量)，V2_1，V2_2 ...(第二个变量)，依此类推。每组12个变量包含二分(是/否)响应，实际上对应于职业状态。例如:工作(是/否)，学习(是/否)等等，总共12种状态，重复30次。

任务:当前的任务是将每组12个二分变量重新编码为具有12个响应类别(例如工作，学习...)的单个变量。最终，我们应该获得30个变量，每个变量具有12个响应类别。

数据:我无法发布实际的数据集，但这是一个很好的模拟近似值:

randomRow <- function() {
  # make a row with a single 1 and some NA's
  sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}

# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
  data <- matrix(NA,ncol=12,nrow=1500)
  for (i in 1:1500) {
    data[i,] <- randomRow()
  }
  return(data)
}

mydata <- NULL

# combine 30 of these dataframes horizontally
for (i in 1:30) {
  mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready

我的解决方案:

# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
  Z <- rep(1:30,each=12) # define selection vector
  mydata[Z==i]           # use selection vector to get groups of variables (x12)
})

recodeDf <- function(df) {
  result <- as.numeric(apply(df,1,function(x) {
    if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
  }))                                          # the if/else check is for the real data
  return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))

总而言之，有一个双重* apply函数，一个在列表中，另一个在数据框行中。这使它变慢了。有什么建议么？提前致谢。

最佳答案

我真的很喜欢@Arun的矩阵乘法想法。有趣的是，如果您针对某些OpenBLAS库编译R，则可以使其并行运行。

但是，我想为您提供另一个也许比矩阵乘法慢的解决方案，该解决方案使用您的原始模式，但比您的实现要快得多:

# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)

如果您有一个非常大的数据帧，并且有许多处理器，则可以考虑将该操作并行化为:

library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.

阅读@Arun和@mnel之后，我学到了很多有关如何改进此功能的方法，即避免强制转换为数组，通过按列而不是按行来处理data.frame。我的意思不是在这里“窃取”答案； OP应该考虑将复选框切换为@mnel的答案。

但是，我希望共享一个不使用data.table的解决方案，并避免使用for。但是，它仍然比@mnel的解决方案慢，尽管有一点点。

nograpes2<-function(mydata) {
  test<-function(df) {
    l<-lapply(df,function(x) which(x==1))
    lens<-lapply(l,length)
    rep.int(seq.int(l),times=lens)[order(unlist(l))]
  }
  S2<-split.default(mydata,rep(1:30,each=12))
  data.frame(lapply(S2,test))
}

我还想补充一点，如果which最初是arr.ind=TRUE而不是mydata，则@Aaron的方法将matrix与data.frame结合使用也将非常快速且优雅。强制matrix比其他功能慢。如果速度是一个问题，那么首先应该考虑以矩阵形式读取数据。