R在data.table中使用粘贴来子集可变列数并计算rowMeans

本文介绍了R在data.table中使用粘贴来子集可变列数并计算rowMeans的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述过去一周我开始使用 data.table ，我遇到了问题。我已查看此处和这里，但我不完全确定如何 c>> dput（dt）结构（list（link = c（1L，1L，1L，1L，1L，1L），id = c（8395,8738,9788,9789,9908,9920），person = c时间= c（NA，NA，25372,50700，NA，NA），max = c（11,14,9,1,7,10）（14,31,9,7,8,11），hr = c（NA，NA，7,14，NA，NA），minhr = c（11,19,7,14,7,16），maxhr = c（11,19,7,14,7,16），TRAVELTIME0.1avg = c（59,59,59,59,59,59），TRAVELTIME1.2avg = c（59,59,59,59,59， 59），TRAVELTIME2.3avg = C（59，59，59，59，59，59），TRAVELTIME3.4avg = C（59.2079086331819，59.2079086331819，59.2079086331819，59.2079086331819，59.2079086331819，59.2079086331819），TRAVELTIME4.5avg = C（59.9182362587214，59.9182362587214 ，59.9182362587214，59.9182362587214，59.9182362587214，59.9182362587214），TRAVELTIME5.6avg = C（60.4905040124798，60.4905040124798，60.4905040124798，60.4905040124798，60.4905040124798，60.4905040124798），TRAVELTIME6.7avg = C（59.2897529410742，59.2897529410742，59.2897529410742，59.2897529410742，59.2897529410742，59.2897529410742），TRAVELTIME7。 8avg = C（59.2717176535874，59.2717176535874，59.2717176535874，59.2717176535874，59.2717176535874，59.2717176535874），TRAVELTIME8.9avg = C（59.2569737174023，59.2569737174023，59.2569737174023，59.2569737174023，59.2569737174023，59.2569737174023），TRAVELTIME9.10avg = C（59.2814811928216，59.2814811928216，59.2814811928216，59.2814811928216， 59.2814811928216，59.2814811928216），TRAVELTIME10.11avg = C（59.2084537775537，59.2084537775537，59.2084537775537，59.2084537775537，59.2084537775537，59.2084537775537），TRAVELTIME11.12avg = C（59.0915653550983，59.0915653550983，59.0915653550983，59.0915653550983，59.0915653550983，59.0915653550983），TRAVELTIME12.13avg = C（59.6765035434587 ，59.6765035434587，59.6765035434587，59.6765035434587，59.6765035434587，59.6765035434587），TRAVELTIME13.14avg = C（59.246760177185，59.246760177185，59.246760177185，59.246760177185，59.246760177185，59.246760177185），TRAVELTIME14.15avg = C（59.4095339982924，59.4095339982924，59.4095339982924，59.4095339982924，59.4095339982924，59.4095339982924） TRAVELTIME15.16avg = C（59.5347570536373，59.5347570536373，59.5347570536373，59.5347570536373，59.5347570536373，59.5347570536373），TRAVELTIME16.17avg = C（59.3799872977671，59.3799872977671，59.3799872977671，59.3799872977671，59.3799872977671，59.3799872977671），TRAVELTIME17.18avg = C（59.1915498629857，59.1915498629857，59.1915498629857， 59.1915498629857，59.1915498629857，59.1915498629857），TRAVELTIME18.19avg = C（59.1663574471712，59.1663574471712，59.1663574471712，59.1663574471712，59.1663574471712，59.1663574471712），TRAVELTIME19.20avg = C（59.0217772215269，59.0217772215269，59.0217772215269，59.0217772215269，59.0217772215269，59.0217772215269），TRAVELTIME20.21avg = C （59.0893371757925，59.0893371757925，59.0893371757925，59.0893371757925，59.0893371757925，59.0893371757925），TRAVELTIME21.22avg = C（59.0272727272727，59.0272727272727，59.0272727272727，59.0272727272727，59.0272727272727，59.0272727272727），TRAVELTIME22.23avg = C（59，59，59，59，59，59 ），TRAVELTIME23.24avg = c（59,59,59,59,59,59），TRAVELTIME24.25avg = c（59,59,59,59,59,59），TRAVELTIME25.26avg = c（59,59， 59,59,59,59），TRAVELTIME26.27avg = c（59,59,59,59,59,59））.Names = c（link，id，person，seqid time，max，hr，minhr，maxhr，TRAVELTIME0.1avg，TRAVELTIME1.2avg，TRAVELTIME2.3avg，TRAVELTIME3.4avg，TRAVELTIME4.5avg TRAVELTIME5.6avg，TRAVELTIME6.7avg，TRAVELTIME7.8avg，TRAVELTIME8.9avg，TRAVELTIME9.10avg，TRAVELTIME10.11avg，TRAVELTIME11.12avg，TRAVELTIME12.13avg，TRAVELTIME13 .14avg，TRAVELTIME14.15avg，TRAVELTIME15.16avg，TRAVELTIME16.17avg，TRAVELTIME17.18avg，TRAVELTIME18.19avg，TRAVELTIME19.20avg，TRAVELTIME20.21avg，TRAVELTIME21.22avg ，TRAVELTIME22.23avg，TRAVELTIME23.24avg，TRAVELTIME24.25avg，TRAVELTIME25.26avg，TRAVELTIME26.27avg），sorted =link，class = c（data.table data.frame），row.names = c（NA，-6L）） Update1：To避免在创建 internal.selfref do dt 的问题和 maxhr 使用上面的示例。变量来计算旅行时间的子集，并计算这些子集旅行时间的 rowMeans ，并将其添加到当前的 dt 。如果 minhr （或 maxhr ）为11，则相应的旅行时间栏为 TRAVELTIME11.12avg ;如果为19，则相应的旅行时间列为 TRAVELTIME19.20avg 。因此，如果 minhr 为9，而 maxhr 为10，则我需要获取 TRAVELTIME9.10avg TRAVELTIME10.11avg ;类似地，如果 minhr 是15并且 maxhr 是17，则我需要获得 TRAVELTIME15.16avg ， TRAVELTIME16.17avg 和 TRAVELTIME17.18avg 。我尝试逐步处理问题，并使用以下代码跨越所有行的均匀行进时间列。它工作正常。 > dt [，avg：= rowMeans（.SD [，TRAVELTIME10.11avg：TRAVELTIME12.13avg，with = FALSE]），by =。（id，seqid）] 接下来，我试图通过引入 paste0（）来动态修改上面的代码。但是，这导致错误。此外，我尝试使用 as.symbol（paste0（））， noquote（paste0（））其他无法成功的技巧。 > dt [，avg：= rowMeans（.SD [，paste0（TRAVELTIME，minhr，。，minhr + 1，avg）：paste0（TRAVELTIME，maxhr，。，maxhr + 1， avg），with = FALSE]），by =。（id，seqid）] 在paste0（TRAVELTIME，minhr，。，minhr + 1，avg）中出错： paste0（TRAVELTIME，： NA / NaN参数此外：警告消息： 1：在eval（expr，envir，enclos）：强制引入的NAs 2 ：在eval（expr，envir，enclos）中：通过强制引入的NAs 问题： 1）如果使用粘贴命令，为什么 data.table 即使在取消索引粘贴的字符串）到子集列，而不是直接使用列名？它与每行的列数不相等有什么关系吗？ 2）因为我不成功，你能建议一种方法来找到平均过变量每行的列数，并将其添加回dt。如果建议导致一个有效的方式，我将不胜感激，因为，我已经尝试使用更简单的循环方法，它需要很长时间（我的整个数据集大约12到15小时）的数据大小。解决方案我相信这解决了你使用 paste0 tmp< - paste0（TRAVELTIME，dt $ minhr，。，dt $ minhr + 1，avg ） tmp1< - paste0（TRAVELTIME，dt $ maxhr，。，dt $ maxhr + 1，avg） dt1 有人可能会指出，你不需要在最后一行严格需要 $ ，但由于性质你所遇到的问题我觉得这对于识别和解决问题是有用的。 I have started using data.table over the past week and am facing an issue. I have already looked at the solution here and here but I am not entirely sure how it helps in my situation.Here, is my sample data.> dput(dt)structure(list(link = c(1L, 1L, 1L, 1L, 1L, 1L), id = c(8395, 8738, 9788, 9789, 9908, 9920), person = c(2937837, 3092435, 3511555, 3511555, 3568112, 3575082), seqid = c(11, 14, 9, 1, 7, 10), time = c(NA, NA, 25372, 50700, NA, NA), max = c(14, 31, 9, 7, 8, 11), hr = c(NA, NA, 7, 14, NA, NA), minhr = c(11, 19, 7, 14, 7, 16), maxhr = c(11, 19, 7, 14, 7, 16), TRAVELTIME0.1avg = c(59, 59, 59, 59, 59, 59 ), TRAVELTIME1.2avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME2.3avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME3.4avg = c(59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819, 59.2079086331819 ), TRAVELTIME4.5avg = c(59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214, 59.9182362587214), TRAVELTIME5.6avg = c(60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798, 60.4905040124798), TRAVELTIME6.7avg = c(59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742, 59.2897529410742 ), TRAVELTIME7.8avg = c(59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874, 59.2717176535874), TRAVELTIME8.9avg = c(59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023, 59.2569737174023), TRAVELTIME9.10avg = c(59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216, 59.2814811928216 ), TRAVELTIME10.11avg = c(59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537, 59.2084537775537 ), TRAVELTIME11.12avg = c(59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983, 59.0915653550983 ), TRAVELTIME12.13avg = c(59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587, 59.6765035434587 ), TRAVELTIME13.14avg = c(59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185, 59.246760177185), TRAVELTIME14.15avg = c(59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924, 59.4095339982924), TRAVELTIME15.16avg = c(59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373, 59.5347570536373 ), TRAVELTIME16.17avg = c(59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671, 59.3799872977671 ), TRAVELTIME17.18avg = c(59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857, 59.1915498629857 ), TRAVELTIME18.19avg = c(59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712, 59.1663574471712 ), TRAVELTIME19.20avg = c(59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269, 59.0217772215269 ), TRAVELTIME20.21avg = c(59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925, 59.0893371757925 ), TRAVELTIME21.22avg = c(59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727, 59.0272727272727 ), TRAVELTIME22.23avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME23.24avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME24.25avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME25.26avg = c(59, 59, 59, 59, 59, 59), TRAVELTIME26.27avg = c(59, 59, 59, 59, 59, 59)), .Names = c("link", "id", "person", "seqid", "time", "max", "hr", "minhr", "maxhr", "TRAVELTIME0.1avg", "TRAVELTIME1.2avg", "TRAVELTIME2.3avg", "TRAVELTIME3.4avg", "TRAVELTIME4.5avg", "TRAVELTIME5.6avg", "TRAVELTIME6.7avg", "TRAVELTIME7.8avg", "TRAVELTIME8.9avg", "TRAVELTIME9.10avg", "TRAVELTIME10.11avg", "TRAVELTIME11.12avg", "TRAVELTIME12.13avg", "TRAVELTIME13.14avg", "TRAVELTIME14.15avg", "TRAVELTIME15.16avg", "TRAVELTIME16.17avg", "TRAVELTIME17.18avg", "TRAVELTIME18.19avg", "TRAVELTIME19.20avg", "TRAVELTIME20.21avg", "TRAVELTIME21.22avg", "TRAVELTIME22.23avg", "TRAVELTIME23.24avg", "TRAVELTIME24.25avg", "TRAVELTIME25.26avg", "TRAVELTIME26.27avg"), sorted = "link", class = c("data.table", "data.frame"), row.names = c(NA, -6L))Update1: To avoid the issue of internal.selfref do dt <- data.table(dt) after you create dt using the above sample.I want to use the minhr and maxhr variables to subset the travel times and calculate the rowMeans for those subsetted travel times and add it to the current dt. If minhr (or maxhr) is 11, the corresponding travel time column is TRAVELTIME11.12avg; if it is 19, the corresponding travel time column is TRAVELTIME19.20avg. So, if minhr is 9 and maxhr is 10 for a row, then I need to get the mean of TRAVELTIME9.10avg and TRAVELTIME10.11avg; similarly, if minhr is 15 and maxhr is 17 then I need to get the mean of TRAVELTIME15.16avg, TRAVELTIME16.17avg, and TRAVELTIME17.18avg.I tried to approach the problem step-wise and used the following code for a simple case of uniform travel time columns across all the rows. It works fine.> dt[,avg:=rowMeans(.SD[,TRAVELTIME10.11avg:TRAVELTIME12.13avg, with=FALSE]),by=.(id, seqid)]Next, I tried to modify the above code by introducing paste0() to refer to the column names dynamically. But, this results in an error. Additionally, I have tried to use as.symbol(paste0()), noquote(paste0()) and a couple of other unquoting techniques without any success. > dt[,avg:=rowMeans(.SD[,paste0("TRAVELTIME", minhr, "." , minhr+1, "avg"):paste0("TRAVELTIME", maxhr, "." , maxhr+1, "avg"), with=FALSE]),by=.(id, seqid)]Error in paste0("TRAVELTIME", minhr, ".", minhr + 1, "avg"):paste0("TRAVELTIME", : NA/NaN argumentIn addition: Warning messages:1: In eval(expr, envir, enclos) : NAs introduced by coercion2: In eval(expr, envir, enclos) : NAs introduced by coercionGiven this, I have two questions:1) Why doesn't data.table recognize the column names if the paste command is used (even after unquoting the pasted strings) to subset columns as opposed to directly using the column names? Does it have anything to do with the unequal number of columns for every row?2) Since I am unsuccessful, can you please suggest a way to find the mean over variable number of columns for every row, and add it back to the dt. I would appreciate if the suggestion leads to an efficient way, because, I already tried this using a simpler looping method and it takes a long time (approximately 12 to 15 hours for my entire dataset) for the size of my data. 解决方案 I believe this solves the problem you were having with paste0:tmp <- paste0("TRAVELTIME", dt$minhr, "." , dt$minhr+1, "avg")tmp1 <- paste0("TRAVELTIME", dt$maxhr, "." , dt$maxhr+1, "avg")dt1 <- dt[,avg:=rowMeans(.SD[,get(tmp):get(tmp1), with=FALSE]),by=.(dt$id, dt$seqid)]Someone will probably point out that you don't strictly need the $ in the last line, but due to the nature of the problem you were having I felt this was useful for identifying and solving the problem. 这篇关于R在data.table中使用粘贴来子集可变列数并计算rowMeans的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！