问题描述
我经常需要在给定时间间隔(事件)内平均时间序列数据,基本上是。
如答案所示,我使用一个SQL语句对我的数据'长'格式。这里是一个例子:
c> 事件 foverlaps 并删除不匹配的时间间隔( nomatch = 0L )。
I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.
As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:
#create dummy data frame set.seed(1) data <- data.frame( date = seq(from = as.POSIXct("2014-01-01 00:00"), to = as.POSIXct("2014-01-31 23:00"), by = 300), A = runif(8917), B = runif(8917), C = runif(8917), D = runif(8917) ) #convert to long format require(dplyr) data <- data %>% gather(class,value,A:D) # create dummy events events <- data.frame( id = c("blue","red","green","yellow"), start = as.POSIXct(c("2014-01-03 13:00", "2014-01-12 08:00", "2014-01-18 10:00", "2014-01-27 23:00")), stop = as.POSIXct(c("2014-01-03 19:00", "2014-01-13 17:00", "2014-01-20 10:00", "2014-01-28 20:00")) ) #average value within events, grouped by class require(sqldf) results <- sqldf(" SELECT x.id, y.class, avg(y.value) AS mean FROM events as x, data as y WHERE y.date between x.start and x.stop GROUP BY x.id, y.class ")
which gives the desired output
id class mean 1 blue A 0.4879129 2 blue B 0.4945888 3 blue C 0.5312504 4 blue D 0.4968260 5 green A 0.5235671 6 green B 0.5030602 7 green C 0.5071219 8 green D 0.5002010 9 red A 0.5122966 10 red B 0.4767966 11 red C 0.5032387 12 red D 0.5018389 13 yellow A 0.4727868 14 yellow B 0.4626688 15 yellow C 0.4930207 16 yellow D 0.5184966
However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.
Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.
If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...
Any suggestions from the experts would be appreciated.
Update
As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.
Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.
I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!
Here's a possible data.table::foverlaps solution
library(data.table) setDT(data)[, `:=`(start = date, stop = date)] setkey(setDT(events), start, stop) foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)] # id class Mean # 1: blue A 0.4879129 # 2: blue B 0.4945888 # 3: blue C 0.5312504 # 4: blue D 0.4968260 # 5: green A 0.5235671 # 6: green B 0.5030602 # 7: green C 0.5071219 # 8: green D 0.5002010 # 9: red A 0.5122966 # 10: red B 0.4767966 # 11: red C 0.5032387 # 12: red D 0.5018389 # 13: yellow A 0.4727868 # 14: yellow B 0.4626688 # 15: yellow C 0.4930207 # 16: yellow D 0.5184966
The logic seems pretty straight forward to me.
- Set start and stop columns within data to overlap against.
- key the events data set by the same columns.
- Run foverlaps and remove unmatched intervals (nomatch = 0L).
- Calculate mean(value) by id and class
这篇关于重叠连接以计算间隔内的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!