子集观察结果相差至少30分钟

本文介绍了子集观察结果相差至少30分钟的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 data.table （约3000万行），由<$ c $中的 datetime 列组成$ c> POSIXct 格式， id 列和其他一些列（在示例中，我只留下了一个不相关的列 x 来表明存在其他需要保留的列）。 dput 位于帖子的底部。

I have a data.table (~30 million rows) consisting of a datetime column in POSIXct format, an id column and a few other columns (in the example, I just left one irrelevant column x to demonstrate that there are other columns present that need to be kept). A dput is at the bottom of the post.

head(DT)
#              datetime          x id
#1: 2016-04-28 16:20:18 0.02461368  1
#2: 2016-04-28 16:41:34 0.88953932  1
#3: 2016-04-28 16:46:07 0.31818101  1
#4: 2016-04-28 17:00:56 0.14711365  1
#5: 2016-04-28 17:09:11 0.54406602  1
#6: 2016-04-28 17:39:09 0.69280341  1

id ，我只需要将那些相差超过30分钟的观察结果进行子集化。

Q: For each id, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table approach to do this (if possible, without extensive looping)?

什么是有效的 data.table 方法来做到这一点（如果可能，而没有广泛的循环）？逻辑也可以描述为（例如下面的评论）：

The logic can also be described as (like in my comment below):

在下面的输出中，我添加了一个列 keep 来指示哪些行应该保留在此示例中，因为它们与每个id保留的先前观察结果相差30分钟以上。困难之处在于，似乎有必要迭代计算时间差（或者至少目前我无法想到一种更有效的方法）。

In the dput below, I added a colum keep to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).

library(data.table)
DT <- structure(list(
  datetime = structure(c(1461853218.81561, 1461854494.81561,
    1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
    1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
    1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
    1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
    1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
  x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
  0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
  0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
  0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
  0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
  0.655705799115822),
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
  keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
           FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
  .Names = c("datetime", "x", "id", "keep"),
  row.names = c(NA, -20L),
  class = c("data.table", "data.frame"))

setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
   by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]

保持列的说明：

行2：3与第1行的相差少于30分钟->删除

第4行的相差更多第1行少于30分钟->保持

第5行距离第4行不足30分钟->删除

第6行差异更大从第4行开始超过30分钟->保持

Rows 2:3 differ by less than 30 minutes from row 1 -> delete
Row 4 differs by more than 30 minutes from row 1 -> keep
Row 5 dufferes by less than 30 minutes from row 4 -> delete
Row 6 differs by more than 30 minutes from row 4 -> keep
...

所需的输出：

desiredDT <- DT[(keep)]

感谢我收到的三个专家解答。我测试了1到1000万行数据。这是基准的摘录。

Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.

a）一百万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
               times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq      max neval
#   frank(DT_Frank)  1.286647  1.277104  1.185216  1.267769  1.140614 1.036749     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000     3
#   eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901     3
#   eddi2(DT_Eddi2)  9.966078  9.915651  9.210168  9.866330  8.877769 8.070281     3

b）1000万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
                times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq       max neval
#   frank(DT_Frank)  1.019561  1.025427  1.026681  1.031061  1.030028  1.029037     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     3
#   eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143     3
#   eddi2(DT_Eddi2)  9.796800  9.693823  9.526193  9.594931  9.398969  9.211019     3

显然， @frank的data.table方法和@Roland的基于Rcpp的解决方案在性能上相似，其中Rcpp略有优势，而@eddi的方法仍然快速，但性能不如其他方法。

Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.

但是，当我检查解决方案是否相等时，我发现@Roland的方法与其他方法相比有稍微不同的结果：

However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:

a）一百万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (982228, 982224) differ"
#[2] "Component "id": Numeric: lengths (982228, 982224) differ"
#[3] "Component "x": Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

b）一千万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component "datetime": Numeric: lengths (9981898, 9981891) differ"
#[2] "Component "id": Numeric: lengths (9981898, 9981891) differ"
#[3] "Component "x": Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

我目前的假设是这种差异可能与差异是> 30分钟还是> = 30分钟有关，尽管我不是

My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.

最后的想法：我决定和@Frank一起去。 s解决方案有两个原因：1.它的性能非常好，几乎与Rcpp解决方案相当；并且2.它不需要另一个我不太熟悉的包（无论如何我都在使用data.table）

Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)

kept

子集观察结果相差至少30分钟

问题描述

推荐答案