我有一个大的数据框(超过3M行)。我正在尝试计算某个ActivityType在21天窗口中出现的次数。我已经从Rolling Sum by Another Variable in R建模了我的解决方案。但是,仅对于一个ActivityType,它就需要花费很长时间。我不认为3M +行会花费过多时间。以下是我尝试的方法:

dt <- read.table(text='

                         Name      ActivityType     ActivityDate
                         John       Email            1/1/2014
                         John       Email            1/3/2014
                         John       Webinar          1/5/2014
                         John       Webinar          1/20/2014
                         John       Webinar          3/25/2014
                         John       Email            4/1/2014
                         John       Email            4/20/2014
                         Tom        Email            1/1/2014
                         Tom       Webinar           1/5/2014
                         Tom       Webinar           1/20/2014
                         Tom       Webinar           3/25/2014
                         Tom       Email             4/1/2014
                         Tom       Email             4/20/2014

                         ', header=T, row.names = NULL)

        library(data.table)
        library(reshape2)
        dt$ActivityType <- factor(dt$ActivityType)
        dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
        dt <- dt[order(dt$Name, dt$ActivityDate),]

   dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
   setDT(dt)
   #Build reference table
        Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
    #Use mapply to get last 21 days of value by Name
    dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
            d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
            sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]

这仅适用于ActivityType = Email,然后我必须对其他ActivityType级别执行相同的操作。我从中获得解决方案的链接谈到了使用“mcapply”而不是“mapply”。请让我知道如何使用mcapply或任何其他可使它更快的解决方案。

以下是预期的输出。对于每一行,我都选择ActivityDate和之前的21天,那21天是我的时间窗口。我计算了ActivityType =“Email”在该时间窗口中出现的所有时间。
              Name      ActivityType     ActivityDate  Email_RollingSum
                 John       Email            1/1/2014         1
                 John       Email            1/3/2014         2
                 John       Webinar          1/5/2014         2
                 John       Webinar          1/20/2014        2
                 John       Webinar          3/25/2014        0
                 John       Email            4/1/2014         1
                 John       Email            4/20/2014        2
                 Tom        Email            1/1/2014         1
                 Tom       Webinar           1/5/2014         1
                 Tom       Webinar           1/20/2014        1
                 Tom       Webinar           3/25/2014        0
                 Tom       Email             4/1/2014         1
                 Tom       Email             4/20/2014        2

最佳答案

尝试一种将数据表用于名称和日期列表以及电子邮件数量来源的方法。这是通过在data.table中使用DTi参数中的DTby = .EACHI来完成的。代码如下所示:

library(data.table)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
         .(type, date,
           Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
           Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),
         by=.EACHI]

以下内容使用与上述相同的方法,但包括一些更改,根据您的数据,这些更改可能会使速度提高30-40%。
  setDT(dt, key = "Name")
  dt[, ":="(ActivityDate = as.Date(dt$ActivityDate, "%m/%d/%Y"),
            ActivityType = as.character(ActivityType) )]
  dt4 <- dt[.(Name=Name,  type=ActivityType, date=ActivityDate), {z=between(ActivityDate, date-21, date);
                                                                  .( type, date,
                                                                     Email=sum( (ActivityType %chin% "Email") & z),
                                                                     Webinar=sum( (ActivityType %chin% "Webinar") & z) ) }
            , by=.EACHI]

关于r - 对ActivityType进行21天滚动总和的最快方法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/34455873/

10-12 17:16