问题描述
我有一个数据集,其中包含在给定日期的特定时间发生的10个事件,每个事件具有相应的值:
d1<-data.frame(date = as.POSIXct(c( 21/05/2010 19:59:37, 21/05/2010 08:40:30,
21 / 05/2010 09:21:00, 21/05/2010 22:29:50, 21/05/2010 11:27:34,
21/05/2010 18:25: 14, 21/05/2010 15:16:01, 21/05/2010 09:41:53,
21/05/2010 15:01:29, 21/05 / 2010 09:02:06),格式=%d /%m /%Y%H:%M:%S),
值= c(11313,42423,64645,643426,1313313, 1313,3535,6476,11313,9875))
我想每3分钟汇总一次结果,以标准数据帧格式显示(从 21/05/2010 00:00:00到 21/05/2010 23:57:00,因此该数据帧具有480个3分钟的bin)
$首先,我创建一个数据框,每个框包含3分钟的时间: d2< -data.frame(da te = seq(as.POSIXct( 2010-05-21 00:00:00),
by = 3 min,length.out =(1440/3)))
然后,我将两个数据帧合并在一起并删除NA:
library(dplyr)
m<-merge(d1,d2,all = TRUE)%>%mutate(值= ifelse(is.na(值),0,值))
最后,我使用 period.apply()$ c
xts
包中的$ c>来汇总每个bin的值:
library(xts)
a<-period.apply(m $ value,endpoints(m $ date, minutes,3),sum)
有没有更有效的方法?感觉不是最佳。
更新#1
我调整了代码在约书亚回答之后:
library(xts)
起点<-函数(x,on = months, k = 1){
head(端点(x,on,k)+ 1,-1)
}
m x<-merge(value = xts(d1 $ value,d1 $ date),xts( ,m))
y<-period.apply(x,c(0,startpoints(x, minutes,3)),sum,na.rm = TRUE)
我不知道 na.rm = TRUE
可以用于 period.apply()
,现在允许我跳过 mutate(value = ifelse(is.na(is.na(value),0,value))
。这是前进的一步,我对这里的 xts
方法很满意,但是我想知道是否有 pure dplyr
解决方案,我可以在这种情况下使用。
更新#2
尝试Khashaa的答案后,出现错误,因为未指定我的时区。所以我有:
> tail(d4)
区间总值
476 2010-05-21 23:45:00 NA
477 2010-05-21 23:48:00 NA
478 2010-05 -21 23:51:00 NA
479 2010-05-21 23:54:00 NA
480 2010-05-21 23:57:00 11313
481 2010-05-22 02:27:00 643426
> d4 [450,]
区间总值
450 2010-05-21 22:27:00 NA
现在,在 Sys.setenv(TZ = UTC)
之后,一切正常。
lubridate-dplyr
式解决方案。
库(润滑)
库(dplyr)
d2<-data.frame(interval = seq(ymd_hms('2010-05-21 00:00:00'),by ='3 min',length.out =(1440/3)))
d3<-d1%&%;%
mutate(interval = floor_date(date,unit = hour)+ minutes(floor(minute(date)/ 3)* 3))%&%;%
group_by(interval)%&%;%
mutate(sumvalue = sum(value))%&%;%
select(interval,sumvalue)
d4<-merge(d2,d3,all = TRUE)#如果使用left_join更好
tail(d4)
#区间总值
#475 2010-05-21 23:42:00 NA
#476 2010-05-21 23:45:00 NA
#477 2010-05-21 23:48:00 NA
#478 2010-05-21 23:51:00 NA
#479 2010-05-21 23:54:00 NA
#480 2010-05-21 23:57:00 NA
d4 [450,]
#区间总值
#450 2010-05-21 22:27:00 643426
如果您愿意使用 Date
(我不是),您可以省去 lubridate
,并用 left_join
替换最终合并。
I have a dataset containing 10 events occuring at a certain time on a given day, with corresponding value for each event:
d1 <- data.frame(date = as.POSIXct(c("21/05/2010 19:59:37", "21/05/2010 08:40:30",
"21/05/2010 09:21:00", "21/05/2010 22:29:50", "21/05/2010 11:27:34",
"21/05/2010 18:25:14", "21/05/2010 15:16:01", "21/05/2010 09:41:53",
"21/05/2010 15:01:29", "21/05/2010 09:02:06"), format ="%d/%m/%Y %H:%M:%S"),
value = c(11313,42423,64645,643426,1313313,1313,3535,6476,11313,9875))
I want to aggregate the results every 3 minutes, in a standard dataframe format (from "21/05/2010 00:00:00" to "21/05/2010 23:57:00", so that the dataframe has 480 bins of 3 minutes each)
First, I create a dataframe containing bins of 3 minutes each:
d2 <- data.frame(date = seq(as.POSIXct("2010-05-21 00:00:00"),
by="3 min", length.out=(1440/3)))
Then, I merge the two dataframes together and remove NAs:
library(dplyr)
m <- merge(d1, d2, all=TRUE) %>% mutate(value = ifelse(is.na(value),0,value))
Finally, I use period.apply()
from the xts
package to sum the values for each bin:
library(xts)
a <- period.apply(m$value, endpoints(m$date, "minutes", 3), sum)
Is there a more efficient way to do this ? It does not feel optimal.
Update #1
I adjusted my code after Joshua's answer:
library(xts)
startpoints <- function (x, on = "months", k = 1) {
head(endpoints(x, on, k) + 1, -1)
}
m <- seq(as.POSIXct("2010-05-21 00:00:00"), by="3 min", length.out=1440/3)
x <- merge(value=xts(d1$value, d1$date), xts(,m))
y <- period.apply(x, c(0,startpoints(x, "minutes", 3)), sum, na.rm=TRUE)
I wasn't aware that na.rm=TRUE
could be used with period.apply()
, which now allows me to skip mutate(value = ifelse(is.na(value),0,value))
. It's a step forward and I'm actually pleased with the xts
approach here but I would like to know if there is a pure dplyr
solution I could use in such a situation.
Update #2
After trying Khashaa's answer, I had an error because my timezone was not specified. So I had:
> tail(d4)
interval sumvalue
476 2010-05-21 23:45:00 NA
477 2010-05-21 23:48:00 NA
478 2010-05-21 23:51:00 NA
479 2010-05-21 23:54:00 NA
480 2010-05-21 23:57:00 11313
481 2010-05-22 02:27:00 643426
> d4[450,]
interval sumvalue
450 2010-05-21 22:27:00 NA
Now, after Sys.setenv(TZ="UTC")
, it all works fine.
lubridate-dplyr
-esque solution.
library(lubridate)
library(dplyr)
d2 <- data.frame(interval = seq(ymd_hms('2010-05-21 00:00:00'), by = '3 min',length.out=(1440/3)))
d3 <- d1 %>%
mutate(interval = floor_date(date, unit="hour")+minutes(floor(minute(date)/3)*3)) %>%
group_by(interval) %>%
mutate(sumvalue=sum(value)) %>%
select(interval,sumvalue)
d4 <- merge(d2,d3, all=TRUE) # better if left_join is used
tail(d4)
# interval sumvalue
#475 2010-05-21 23:42:00 NA
#476 2010-05-21 23:45:00 NA
#477 2010-05-21 23:48:00 NA
#478 2010-05-21 23:51:00 NA
#479 2010-05-21 23:54:00 NA
#480 2010-05-21 23:57:00 NA
d4[450,]
# interval sumvalue
#450 2010-05-21 22:27:00 643426
If you are comfortable working with Date
(I am not), you can dispense with lubridate
, and replace the final merge with left_join
.
这篇关于每隔n分钟与dplyr分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!