按唯一日期移动平均值

按唯一日期移动平均值

本文介绍了按唯一日期移动平均值,每个日期有多个观测值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,每个日期可能包含多个观测值。因此,在日期1上可能有5个观测值,在日期2上有2个观测值,在组3上有1个观测值。

I have a dataset that may contain MULTIPLE observations per date. So there could be 5 observations on date1, 2 observations on date2, and 1 observation on group3.

我想按日期计算移动平均值-重要的是,没有总结/减少行数。那就是在上面的例子中,我仍然有8行数据,在它旁边的一列中,我将具有该日期的滚动平均价格,我发现这个挑战很艰巨,因为当我使用ZOO包中的典型滚动功能时,它会逐行进行-line,我不知道如何在DATE之前跳过它

I want to calculate the moving average - by date - and importantly while not summarising/reducing' the number of rows. That is In this example above, I would still have 8 rows of data, and in a column next to it I would have that date's rolling average price I find this challenging because when I use a typical rolling function from ZOO package it goes line by-line and I dont know how to get it to skip by DATE

例如第一步通常是:

df %>%
 groupy_by(DATE) %>%
 summarise(mean_daily_price = mean(price)) %>%
 ungroup() %>%
 arrange(Date) %>%
 mutate( ra = rollapply(price, 2, mean, partial=T)

---但是摘要使我输掉了行。

--- but the summarise makes me lose rows.

      library(dplyr)
      library(zoo)


            DF = structure(list(Date = c("Jan-13", "Jan-13", "Jan-13", "Jan-13",  "Jan-13", "Jul-14", "Jul-14", "Oct-16"), Price = c(100L, 200L,  300L, 1000L, 400L, 150L, 50L, 600L), Average.by.Date = c(400L,  400L, 400L, 400L, 400L, 100L, 100L, 600L), Moving_Average_Size_2 = c(NA,  NA, NA, NA, NA, 250L, 250L, 350L)), .Names = c("Date", "Price", "Average.by.Date", "Moving_Average_Size_2"), class = "data.frame", row.names = c(NA,
-8L))


推荐答案

我认为您最安全的方法需要分两步进行-通过 Date 然后合并回去(仍然在此处使用 dplyr

I think that your safest approach will need to be a two step process -- calculate the rolling averages by Date then merge them back in (still using dplyr here)

rolledAvg <-
  DF %>%
  group_by(Date) %>%
  summarise(mean_daily_price = mean(Price)) %>%
  ungroup() %>%
  arrange(Date) %>%
  mutate( ra = rollapply(mean_daily_price
                         , 2
                         , mean
                         , partial=T
                         , fill = NA))

left_join(DF, rolledAvg)

给出:

    Date Price Average.by.Date Moving_Average_Size_2 mean_daily_price  ra
1 Jan-13   100             400                    NA              400 250
2 Jan-13   200             400                    NA              400 250
3 Jan-13   300             400                    NA              400 250
4 Jan-13  1000             400                    NA              400 250
5 Jan-13   400             400                    NA              400 250
6 Jul-14   150             100                   250              100 350
7 Jul-14    50             100                   250              100 350
8 Oct-16   600             600                   350              600 600

我在对另一个答案的评论中看到您不认为第一个 ra 应该为250-如果是这种情况,请将调用中的计算更改为 rollapply 。目前,它的行为似乎与预期/记录的一样。因此,如果您想要不同的东西,则需要解释您想要的更改(可能是一个单独的问题)。

I see in a comment to another answer that you don't think the first ra should be 250 -- if that is the case, change the calculation in your call to rollapply. Right now, it appears to be behaving as expected/documented. So, if you want something different, you will need to explain the change you want (probably a separate question).

另一个警告,对于那些绊倒其他读者特别重要基于此:当前的方法将连续的日期输入视为等距的,无论它们实际上有多远。如果这适合您的需求,那就太好了。但是,在许多使用情况下,您可能需要注意两次测量之间的实际时间间隔。

Another caveat, particularly important for any other readers that stumble on this: this current approach treats sequential date entries as equidistant, no matter how far apart they actually are. If this works for your need, great. But, in many use cases, you may need to take care of the actual amount of time between measurements.

类似地,当前方法会丢失有关测量次数的所有信息因此,可能有必要考虑一种以观察次数加权的方法(除非您乐于相信每天的计算平均值)。

Similarly, the current approach loses all information about the number of measurements taken, it may be worthwhile to consider an approach that weights by number of observations (unless you are happy trusting each day's calculated average).

这篇关于按唯一日期移动平均值,每个日期有多个观测值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 16:10