问题描述
我创建了一些代码来处理以下任务:
I have created some code to handle the following task:
ref = read.table(header=TRUE, text="
user event
1441 120120102
1441 120120888
1443 120122122
1445 120124452
1445 120123525
1446 120123463", stringsAsFactors=FALSE)
data = read.table(header=TRUE, text="
user event1 event2
1440 120123432 120156756
1441 120128523 120156545
1441 120123333 120146444
1441 120122344 120122355", stringsAsFactors=FALSE)
我在这里是一个函数用户Carlos Cinelli),它将允许我一行一行地在表 data
上搜索并记录在event1和event2之间夹着ref的多少事件, user
id。
What I have here is a function (credit to user Carlos Cinelli) that will allow me to go line by line on the table data
and search and record how many events of ref are sandwiched between event1 and event2, by user
id.
现在,我想知道是否有更快的方法来执行下面的函数:
Now, I am wondering if there is a faster way to do the function below:
count <- function(x,y,z) ref[, sum(event >=x & event <= y & user ==z)]
data[, count:=mapply(x=event1, y=event2, z=user, count)]
我不能做太多,想知道 data.table
包是否有什么可以帮助使上述更快。非常感谢你!
I haven't been able to do much and was wondering if the data.table
package would have anything that can help with making the above faster. Thank you so much!
推荐答案
看看的例子foverlaps
。它们清楚地显示了如何根据其他标识符中的重叠间隔加入。
Have a look at the examples from ?foverlaps
. They clearly show how you can join based on overlapping intervals within other identifiers.
require(data.table) ## 1.9.3+
setDT(ref)
setDT(data)
setkey(ref[, event2 := event])
ans = foverlaps(data, ref, by.x=c("user", "event1", "event2"), which=TRUE, nomatch=0L)
$ b b
您的示例尤其糟糕,因为有无重叠。所以我不能真正展示接下来的几个步骤。但 ans
应该为您提供重叠的行索引 ref
( yid
) data
( xid
)中的每一行。并且在 user
中获得重叠,因为它也被设置为键列。
Your example is particularly bad because there are no overlaps. So I can't really demonstrate the next few steps. But ans
should provide you with overlapping row indices of ref
(yid
) for each row in data
(xid
). And the overlaps are obtained within user
- since it was set as a key column as well.
我希望你能从这里拿到...如果你发现这不能解决,请发布一个例子,我可以运行来重现你遇到的同一个问题。
I hope you can take it from here... If you find this doesn't resolve, please post an example that I can run to reproduce the same issue you're running into.
HTH
这篇关于有没有办法有效地使用data.table计数A中的列值落在B的范围内?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!