数据 给定两个 data.table s( dt 和 dt_lookup ) library(data.table) set.seed(1234)t< - seq(1,100); l - 字母; la n dt thisTime = sample ,replace = TRUE), thisLocation = sample(la,n,replace = TRUE), finalLocation = sample(lb,n,replace = TRUE) set.seed(4321) dt_lookup lkpTime = sample (t,10000,replace = TRUE), lkpLocation = sample(l,10000,replace = TRUE)) ##注意:lkpId是循环使用 setkey(dt_lookup,lkpLocation) 我有一个函数可以找到 lkpId 包含 thisLocation 和 finalLocation ,并且具有'nearest' lkpTime (即 thisTime - lkpTime 的最小非负值) > ##函数获取'next'lkpId带有thisLocation和finalLocation的lkpId, ##和thisTime和dt_lookup之间的最小非负时间$ lkpTime) getId ##基于thisLocation和finalLocation, ##的过滤器查找,并且只返回lkpId具有'this'和'final'位置的值 tempThis< - unique(dt_lookup [lkpLocation = = thisLocation,lkpId]) tempFinal availServices tempThisFinal< - dt_lookup [lkpId%in%availServices& lkpLocation == thisLocation,。(lkpId,lkpTime)] ## calcualte'thisTime'和'lkpTime'之间的时间差(来自thisLocation) temp2 ##取具有最小非负差分的lkpId selectedId< - tempThisFinal [min(which(temp2 == min(temp2 [temp2> 0]))),lkpId ] selectedId } 尝试解决方案 我需要为 dt lkpId $ c>。因此,我的初始本能是使用 * apply 函数,但是对于 n / nrow> 1,000,000 。所以我试图实现一个 data.table 解决方案,看看它是否更快: selectedId 但是,我是 data.table 的新用户,并且这个方法看起来不会比 * apply 解决方案: lkpIds< - apply (x){ thisLocation< - as.character(x [[ThisLocation]]) finalLocation thisTime myId }) 两者花费约30秒(n = 10,000)。 问题 有更好的方法使用 data.table 在 dt 的每一行上应用 getId p> 更新12/08/2015 我重新设计了我的整个算法,并利用滚动连接(一个好的介绍) ),从而正确使用 data.table 。 Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution: lkpIds <- apply(dt, 1, function(x){ thisLocation <- as.character(x[["thisLocation"]]) finalLocation <- as.character(x[["finalLocation"]]) thisTime <- as.numeric(x[["thisTime"]]) myId <- getId(thisTime, thisLocation, finalLocation)})both taking ~30 seconds for n = 10,000.QuestionIs there a better way of using data.table to apply the getId function over each row of dt ?Update 12/08/2015Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later. 解决方案 Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.And, there will no doubt be better ways of programming this, but here's my attempt.## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'## and where the lkpId contains both 'thisLocation' and 'finalLocation'## find all lookup id's where 'thisLocation' matches 'lookupLocation'## and where thisTime - lkpTime > 0setkey(dt, thisLocation)setkey(dt_lookup, lkpLocation)dt_this <- dt[dt_lookup, { idx = thisTime - i.lkpTime > 0 .(id = id[idx], lkpId = i.lkpId, thisTime = thisTime[idx], lkpTime = i.lkpTime)},by=.EACHI]## remove NAsdt_this <- dt_this[complete.cases(dt_this)]## find all matching 'finalLocation' and 'lookupLocaiton'setkey(dt, finalLocation)## inner join (and only return the id columns)dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')setkey(dt_this, id, lkpId)setkey(dt_final, id, lkpId)dt_join <- dt_this[dt_final, nomatch=0]## take the combination with the minimum difference between 'thisTime' and 'lkpTime'dt_join[,timeDiff := thisTime - lkpTime]dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]## equivalent dplyr code# library(dplyr)# dt_this <- dt_this %>%# group_by(id) %>%# arrange(timeDiff) %>%# slice(1) %>%# ungroup
