本文介绍了重叠连接以计算间隔内的平均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要在给定时间间隔(事件)内平均时间序列数据,基本上是。



如答案所示,我使用一个SQL语句对我的数据'长'格式。这里是一个例子:

  c> 事件  foverlaps 并删除不匹配的时间间隔( nomatch = 0L )。 
  • code> 由 id 和


    I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.

    As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:

    #create dummy data frame
    set.seed(1)
    data <- data.frame(
      date = seq(from = as.POSIXct("2014-01-01 00:00"),
                 to = as.POSIXct("2014-01-31 23:00"),
                 by = 300),
      A = runif(8917),
      B = runif(8917),
      C = runif(8917),
      D = runif(8917)
    )
    
    #convert to long format
    require(dplyr)
    data <- data %>%
      gather(class,value,A:D)
    
    # create dummy events
    events <- data.frame(
      id = c("blue","red","green","yellow"),
      start = as.POSIXct(c("2014-01-03 13:00",
                           "2014-01-12 08:00",
                           "2014-01-18 10:00",
                           "2014-01-27 23:00")),
      stop = as.POSIXct(c("2014-01-03 19:00",
                           "2014-01-13 17:00",
                           "2014-01-20 10:00",
                           "2014-01-28 20:00"))
    )
    
    
    #average value within events, grouped by class
    require(sqldf)
    results <- sqldf("
         SELECT x.id, y.class, avg(y.value) AS mean
         FROM events as x, data as y
         WHERE y.date between x.start and x.stop
         GROUP BY x.id, y.class
    ")
    

    which gives the desired output

           id class      mean
    1    blue     A 0.4879129
    2    blue     B 0.4945888
    3    blue     C 0.5312504
    4    blue     D 0.4968260
    5   green     A 0.5235671
    6   green     B 0.5030602
    7   green     C 0.5071219
    8   green     D 0.5002010
    9     red     A 0.5122966
    10    red     B 0.4767966
    11    red     C 0.5032387
    12    red     D 0.5018389
    13 yellow     A 0.4727868
    14 yellow     B 0.4626688
    15 yellow     C 0.4930207
    16 yellow     D 0.5184966
    

    However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.

    Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.

    If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...

    Any suggestions from the experts would be appreciated.

    Update

    As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.

    Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.

    I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!

    解决方案

    Here's a possible data.table::foverlaps solution

    library(data.table)
    setDT(data)[, `:=`(start = date, stop = date)]
    setkey(setDT(events), start, stop)
    foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
    #         id class      Mean
    #  1:   blue     A 0.4879129
    #  2:   blue     B 0.4945888
    #  3:   blue     C 0.5312504
    #  4:   blue     D 0.4968260
    #  5:  green     A 0.5235671
    #  6:  green     B 0.5030602
    #  7:  green     C 0.5071219
    #  8:  green     D 0.5002010
    #  9:    red     A 0.5122966
    # 10:    red     B 0.4767966
    # 11:    red     C 0.5032387
    # 12:    red     D 0.5018389
    # 13: yellow     A 0.4727868
    # 14: yellow     B 0.4626688
    # 15: yellow     C 0.4930207
    # 16: yellow     D 0.5184966
    


    The logic seems pretty straight forward to me.

    1. Set start and stop columns within data to overlap against.
    2. key the events data set by the same columns.
    3. Run foverlaps and remove unmatched intervals (nomatch = 0L).
    4. Calculate mean(value) by id and class

    这篇关于重叠连接以计算间隔内的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

  • 08-20 19:33