每隔n分钟与dplyr分组 | 每隔n分钟与dplyr分组

本文介绍了每隔n分钟与dplyr分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据集，其中包含在给定日期的特定时间发生的10个事件，每个事件具有相应的值：

  d1<-data.frame（date = as.POSIXct（c（ 21/05/2010 19:59:37， 21/05/2010 08:40:30，
 21 / 05/2010 09:21:00， 21/05/2010 22:29:50， 21/05/2010 11:27:34，
 21/05/2010 18:25： 14， 21/05/2010 15:16:01， 21/05/2010 09:41:53，
 21/05/2010 15:01:29， 21/05 / 2010 09:02:06），格式=％d /％m /％Y％H：％M：％S），
值= c（11313,42423,64645,643426,1313313， 1313,3535,6476,11313,9875））

我想每3分钟汇总一次结果，以标准数据帧格式显示（从 21/05/2010 00:00:00到 21/05/2010 23:57:00，因此该数据帧具有480个3分钟的bin）

$首先，我创建一个数据框，每个框包含3分钟的时间：

  d2< -data.frame（da te = seq（as.POSIXct（ 2010-05-21 00:00:00），
 by = 3 min，length.out =（1440/3）））

然后，我将两个数据帧合并在一起并删除NA：

  library（dplyr）
m<-merge（d1，d2，all = TRUE）％>％mutate（值= ifelse（is.na（值），0，值））

最后，我使用 period.apply（） xts 包中的$ c>来汇总每个bin的值：

  library（xts）
a<-period.apply（m $ value，endpoints（m $ date， minutes，3），sum）

有没有更有效的方法？感觉不是最佳。

更新＃1

我调整了代码在约书亚回答之后：

  library（xts）
起点<-函数（x，on = months， k = 1）{
 head（端点（x，on，k）+ 1，-1）
} 
 
m x<-merge（value = xts（d1 $ value，d1 $ date），xts（ ，m））
y<-period.apply（x，c（0，startpoints（x， minutes，3）），sum，na.rm = TRUE）

我不知道 na.rm = TRUE 可以用于 period.apply（），现在允许我跳过 mutate（value = ifelse（is.na（is.na（value），0，value））。这是前进的一步，我对这里的 xts 方法很满意，但是我想知道是否有 pure dplyr 解决方案，我可以在这种情况下使用。

更新＃2

尝试Khashaa的答案后，出现错误，因为未指定我的时区。所以我有：

 > tail（d4）
区间总值
 476 2010-05-21 23:45:00 NA 
 477 2010-05-21 23:48:00 NA 
 478 2010-05 -21 23:51:00 NA 
 479 2010-05-21 23:54:00 NA 
 480 2010-05-21 23:57:00 11313 
 481 2010-05-22 02:27:00 643426 
> d4 [450，] 
区间总值
 450 2010-05-21 22:27:00 NA

现在，在 Sys.setenv（TZ = UTC）之后，一切正常。

解决方案

lubridate-dplyr 式解决方案。

 库（润滑）
库（dplyr）
 d2<-data.frame（interval = seq（ymd_hms（'2010-05-21 00:00:00'），by ='3 min'，length.out =（1440/3）））
 d3<-d1％&％;％
 mutate（interval = floor_date（date，unit = hour）+ minutes（floor（minute（date）/ 3）* 3））％&％;％
 group_by（interval）％&％;％
 mutate（sumvalue = sum（value））％&％;％
 select（interval，sumvalue）
 d4<-merge（d2，d3，all = TRUE）＃如果使用left_join更好
 tail（d4）
＃区间总值
＃475 2010-05-21 23:42:00 NA 
＃476 2010-05-21 23:45:00 NA 
＃477 2010-05-21 23:48:00 NA 
＃478 2010-05-21 23:51:00 NA 
＃479 2010-05-21 23:54:00 NA 
＃480 2010-05-21 23:57:00 NA 
 d4 [450，] 
＃区间总值
＃450 2010-05-21 22:27:00 643426

如果您愿意使用 Date （我不是），您可以省去 lubridate ，并用 left_join 替换最终合并。

I have a dataset containing 10 events occuring at a certain time on a given day, with corresponding value for each event:

d1 <- data.frame(date = as.POSIXct(c("21/05/2010 19:59:37", "21/05/2010 08:40:30",
                            "21/05/2010 09:21:00", "21/05/2010 22:29:50", "21/05/2010 11:27:34",
                            "21/05/2010 18:25:14", "21/05/2010 15:16:01", "21/05/2010 09:41:53",
                            "21/05/2010 15:01:29", "21/05/2010 09:02:06"), format ="%d/%m/%Y %H:%M:%S"),
                 value = c(11313,42423,64645,643426,1313313,1313,3535,6476,11313,9875))

I want to aggregate the results every 3 minutes, in a standard dataframe format (from "21/05/2010 00:00:00" to "21/05/2010 23:57:00", so that the dataframe has 480 bins of 3 minutes each)

First, I create a dataframe containing bins of 3 minutes each:

d2 <- data.frame(date = seq(as.POSIXct("2010-05-21 00:00:00"),
                            by="3 min", length.out=(1440/3)))

Then, I merge the two dataframes together and remove NAs:

library(dplyr)
m <- merge(d1, d2, all=TRUE) %>% mutate(value = ifelse(is.na(value),0,value))

Finally, I use period.apply() from the xts package to sum the values for each bin:

library(xts)
a <- period.apply(m$value, endpoints(m$date, "minutes", 3), sum)

Is there a more efficient way to do this ? It does not feel optimal.

Update #1

I adjusted my code after Joshua's answer:

library(xts)
startpoints <- function (x, on = "months", k = 1) {
  head(endpoints(x, on, k) + 1, -1)
}

m <- seq(as.POSIXct("2010-05-21 00:00:00"), by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m))
y <- period.apply(x, c(0,startpoints(x, "minutes", 3)), sum, na.rm=TRUE)

I wasn't aware that na.rm=TRUE could be used with period.apply(), which now allows me to skip mutate(value = ifelse(is.na(value),0,value)). It's a step forward and I'm actually pleased with the xts approach here but I would like to know if there is a pure dplyr solution I could use in such a situation.

Update #2

After trying Khashaa's answer, I had an error because my timezone was not specified. So I had:

> tail(d4)
               interval sumvalue
476 2010-05-21 23:45:00       NA
477 2010-05-21 23:48:00       NA
478 2010-05-21 23:51:00       NA
479 2010-05-21 23:54:00       NA
480 2010-05-21 23:57:00    11313
481 2010-05-22 02:27:00   643426
> d4[450,]
               interval sumvalue
450 2010-05-21 22:27:00       NA

Now, after Sys.setenv(TZ="UTC"), it all works fine.

解决方案

lubridate-dplyr-esque solution.

library(lubridate)
library(dplyr)
d2 <- data.frame(interval = seq(ymd_hms('2010-05-21 00:00:00'), by = '3 min',length.out=(1440/3)))
d3 <- d1 %>%
  mutate(interval = floor_date(date, unit="hour")+minutes(floor(minute(date)/3)*3)) %>%
  group_by(interval) %>%
  mutate(sumvalue=sum(value))  %>%
  select(interval,sumvalue)
d4 <- merge(d2,d3, all=TRUE) # better if left_join is used
tail(d4)
#               interval sumvalue
#475 2010-05-21 23:42:00       NA
#476 2010-05-21 23:45:00       NA
#477 2010-05-21 23:48:00       NA
#478 2010-05-21 23:51:00       NA
#479 2010-05-21 23:54:00       NA
#480 2010-05-21 23:57:00       NA
d4[450,]
#               interval sumvalue
#450 2010-05-21 22:27:00   643426

If you are comfortable working with Date (I am not), you can dispense with lubridate, and replace the final merge with left_join.

这篇关于每隔n分钟与dplyr分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！