问题描述
我有一个 data.table
(约3000万行),由<$ c $中的 datetime
列组成$ c> POSIXct 格式, id
列和其他一些列(在示例中,我只留下了一个不相关的列 x
来表明存在其他需要保留的列)。 dput
位于帖子的底部。
I have a data.table
(~30 million rows) consisting of a datetime
column in POSIXct
format, an id
column and a few other columns (in the example, I just left one irrelevant column x
to demonstrate that there are other columns present that need to be kept). A dput
is at the bottom of the post.
head(DT)
# datetime x id
#1: 2016-04-28 16:20:18 0.02461368 1
#2: 2016-04-28 16:41:34 0.88953932 1
#3: 2016-04-28 16:46:07 0.31818101 1
#4: 2016-04-28 17:00:56 0.14711365 1
#5: 2016-04-28 17:09:11 0.54406602 1
#6: 2016-04-28 17:39:09 0.69280341 1
id ,我只需要将那些相差超过30分钟的观察结果进行子集化。
Q: For each id
, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table
approach to do this (if possible, without extensive looping)?
什么是有效的 data.table
方法来做到这一点(如果可能,而没有广泛的循环)?逻辑也可以描述为(例如下面的评论):
The logic can also be described as (like in my comment below):
在下面的输出中,我添加了一个列 keep
来指示哪些行应该保留在此示例中,因为它们与每个id保留的先前观察结果相差30分钟以上。困难之处在于,似乎有必要迭代计算时间差(或者至少目前我无法想到一种更有效的方法)。
In the dput below, I added a colum keep
to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).
library(data.table)
DT <- structure(list(
datetime = structure(c(1461853218.81561, 1461854494.81561,
1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
0.655705799115822),
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
.Names = c("datetime", "x", "id", "keep"),
row.names = c(NA, -20L),
class = c("data.table", "data.frame"))
setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]
保持
列的说明:
- 行2:3与第1行的相差少于30分钟->删除
- 第4行的相差更多第1行少于30分钟->保持
- 第5行距离第4行不足30分钟->删除
- 第6行差异更大从第4行开始超过30分钟->保持
- ...
- Rows 2:3 differ by less than 30 minutes from row 1 -> delete
- Row 4 differs by more than 30 minutes from row 1 -> keep
- Row 5 dufferes by less than 30 minutes from row 4 -> delete
- Row 6 differs by more than 30 minutes from row 4 -> keep
- ...
所需的输出:
desiredDT <- DT[(keep)]
感谢我收到的三个专家解答。我测试了1到1000万行数据。这是基准的摘录。
Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.
a)一百万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.286647 1.277104 1.185216 1.267769 1.140614 1.036749 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901 3
# eddi2(DT_Eddi2) 9.966078 9.915651 9.210168 9.866330 8.877769 8.070281 3
b)1000万行
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.019561 1.025427 1.026681 1.031061 1.030028 1.029037 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143 3
# eddi2(DT_Eddi2) 9.796800 9.693823 9.526193 9.594931 9.398969 9.211019 3
显然, @frank的data.table方法和@Roland的基于Rcpp的解决方案在性能上相似,其中Rcpp略有优势,而@eddi的方法仍然快速,但性能不如其他方法。
Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.
但是,当我检查解决方案是否相等时,我发现@Roland的方法与其他方法相比有稍微不同的结果:
However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:
a)一百万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (982228, 982224) differ"
#[2] "Component "id": Numeric: lengths (982228, 982224) differ"
#[3] "Component "x": Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
b)一千万行
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (9981898, 9981891) differ"
#[2] "Component "id": Numeric: lengths (9981898, 9981891) differ"
#[3] "Component "x": Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
我目前的假设是这种差异可能与差异是> 30分钟还是> = 30分钟有关,尽管我不是
My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.
最后的想法:我决定和@Frank一起去。 s解决方案有两个原因:1.它的性能非常好,几乎与Rcpp解决方案相当;并且2.它不需要另一个我不太熟悉的包(无论如何我都在使用data.table)
Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)
推荐答案
这就是我要做的事情:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
给出
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
其背后的想法由OP :
The idea behind it is described by the OP in a comment:
这篇关于子集观察结果相差至少30分钟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!