本文介绍了使用列作为参数在data.table中按行应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我试图应用一个函数的行使用data.table与列作为参数。我目前正在使用此处但是,我的data.table是2700万行,有7列,所以apply操作需要很长时间,当我对许多输入文件递归运行时,作业占用所有可用RAM(32Gb)。这可能是我复制data.table多次,虽然我不知道这一点。 我想帮助使这个代码更高的内存效率,因为每个输入文件将大约30万行乘以7列,并且有30个输入文件要处理。我相当肯定,使用apply的行会减慢整个代码,所以更有效率的内存或使用矢量化函数的选择可能是更好的选择。 有很多麻烦试图写一个矢量化函数,它接受4列作为参数,并使用data.table逐行操作。在我的示例代码中的应用解决方案工作,但它非常缓慢。我尝试的一个选择是: cols = c(C,T,A,G) func1< -function(x)x [max1(x)] datU [,high1a:= func1(cols),by = 1:nrow(datU)] ,但datU data.table输出的前6行如下所示: 周期标签ID colA colB colC colG high1 high1a 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC 6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC 这里是我的代码使用apply上面的high1列),但是太慢和内存密集: #从顶层目录中获取输入文件,搜索所有子目录 file_list< - list.files(pattern =* .test.txt,recursive = TRUE,full.names = TRUE) #循环从子目录中循环读取文件,确定指定列中的最高值和第二高值,使用这些值创建新列 savelist = NULL for(i in file_list){ datU< - fread(i) name = dirname(i) #每行最高和第二高(cols 4,5,6,7),以及最高和第二高值 maxn max1 max2 colNum = c(4,5,6,7) datU [,high1:= apply(datU [,colNum,with = FALSE],1,function(x)x [max1 x]])] datU [,high2:= apply(datU [,colNum,with = FALSE],1,function(x)x [max2(x)差别:= high1-high2,by = 1:nrow(datU)] datU [,folder:= name] savelist [[i]]< -datU } #Create循环遍历文件夹和输出数据 sigout = NULL for(i in savelist){ #做一些操作数据框架,然后合并输出 setkey(i,Cycle,folder) Sums1 MeanTot Meandiff Meandiffsd& [,list(meandiff = sd(difference)),by = list(Cycle,folder)] df1out sigout #Output values write.table (sigout,Sigout.txt,append = TRUE,quote = FALSE,sep =,,row.names = FALSE,col.names = TRUE)} 我会喜欢一些关于应用替代函数的例子,这将给我的列4,5,6的每行的最高和第二高的值, 解决方案 / div> 您可以这样做: DF 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC 6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC,标题= TRUE) 库(data.table) setDT(DF) maxTwo ind #so它可以作为一个函数参数#为了更好的效率 as.list(sort。 int(x,partial = ind)[ind])#partial sorted } DF [,paste0(max,1:2):= maxTwo ), by = seq_len(nrow(DF)),.SDcols = 4:7] DF [,diffMax:= max2 - max1] # colB colC colG high1 high1a max1 max2 diffMax #1:1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057 #2:2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795 #3:3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157 #4:4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049 #5:5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156 #6:6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911 但是,你仍然会循环遍历这些行,这意味着 nrow 调用函数。你可以尝试Rcpp在编译的代码中循环。 I am trying to apply a function by row using data.table with columns as arguments. I am currently using apply as suggested hereHowever, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). It's likely that I am copying the data.table multiple times, though I'm not sure about that.I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options.I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. The apply solution in my example code works but it's very slow. One alternative I tried is:cols=c("C","T","A","G")func1<-function(x)x[max1(x)]datU[,high1a:=func1(cols),by=1:nrow(datU)]but the first 6 rows of the datU data.table output look like this: Cycle Tab ID colA colB colC colG high1 high1a1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colCHere is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive:#Get input files from top directory, searching through all subdirectories file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values savelist=NULL for (i in file_list) { datU <- fread(i) name=dirname(i) #Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values maxn <- function(n) function(x) order(x, decreasing = TRUE)[n] max1 <- maxn(1) max2 <- maxn(2) colNum=c(4,5,6,7) datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])]) datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])] datU[,difference:=high1-high2,by=1:nrow(datU)] datU[,folder:=name] savelist[[i]]<-datU}#Create loop to iterate over folders and output datasigout=NULLfor (i in savelist) { # Do some stuff to manipulate data frames, then merge them for outputsetkey(i,Cycle,folder)Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))sigout<-merge(df1out,df2out)#Output values write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)}I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name.Thank you! 解决方案 You could do something like this:DF <- read.table(text = " Cycle Tab ID colA colB colC colG high1 high1a1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC 6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC", header = TRUE)library(data.table)setDT(DF)maxTwo <- function(x) { ind <- length(x) - (1:0) #the index is equal for all rows, #so it could be made a function parameter #for better efficiency as.list(sort.int(x, partial = ind)[ind]) #partial sorting}DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)), by = seq_len(nrow(DF)), .SDcols = 4:7]DF[, diffMax := max2 - max1]# Cycle Tab ID colA colB colC colG high1 high1a max1 max2 diffMax#1: 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057#2: 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795#3: 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157#4: 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049#5: 5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156#6: 6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911However, you'd still be looping over the rows, which means nrow calls to the function. You could try Rcpp to do the looping in compiled code. 这篇关于使用列作为参数在data.table中按行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
09-23 02:35