本文介绍了使用dplyr和for循环添加多个滞后变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有预测的时间序列数据,因此我正在创建滞后变量以用于统计分析。我想要一种给定特定输入来创建多个变量的快速方法,以便我可以轻松地交叉验证和比较模型。

I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.

以下是示例代码,该代码增加了2个滞后给定特定类别(A,B,C)的2个不同变量(共4个):

The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):

# Load dplyr
library(dplyr)

# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))

# create data frame
data = data.frame(days, cats, values1, values2)

# mutate new lag variables
LagVal = data %>% arrange(days) %>% group_by(cats) %>%
  mutate(LagVal1.1 = lag(values1, 1)) %>%
  mutate(LagVal1.2 = lag(values1, 2)) %>%
  mutate(LagVal2.1 = lag(values2, 1)) %>%
  mutate(LagVal2.2 = lag(values2, 2))

LagVal

       days   cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
  <int> <fctr>   <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      A      16     -10        NA        NA        NA        NA
2     2      B      14      24        NA        NA        NA        NA
3     3      C      16      -6        NA        NA        NA        NA
4     4      A      12      25        16        NA       -10        NA
5     5      B      20      14        14        NA        24        NA
6     6      C      18      -5        16        NA        -6        NA
7     7      A      21       2        12        16        25       -10
8     8      B      19       5        20        14        14        24
9     9      C      18      -3        18        16        -5        -6

我的问题出在#突变新的滞后变量步骤,因为我有大约十二个预测变量,我可能希望滞后10次。 (〜13k行数据集),而我却无心创建120个新变量。

My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.

这是我尝试编写的函数,该函数在给定 data (要更改数据集)的输入的情况下,对新变量进行突变。 ,变量(您希望滞后的变量)和滞后量(每个变量的滞后数):

Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):

MultiMutate = function(data, variables, lags){
  # select the data to be working with
  FuncData = data
  # Loop through desired variables to mutate
  for (i in variables){
    # Loop through number of desired lags
    for (u in 1:lags){
      FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
        # Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
        mutate(paste(i, u) = lag(i, u))
    }
  }
  FuncData
}

说实话,我只是对如何使它起作用感到迷茫。我的for循环和整体逻辑的顺序很有意义,但是该函数将字符带入变量的方式以及整体语法似乎还遥遥无期。有没有简单的方法可以修复此功能以获得所需的结果?

To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?

特别是,我在寻找:


  1. MultiMutate(data = data,variables = c(values1,values2),滞后= 2)这样的函数会从上面创建 LagVal 的确切结果。

  1. A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.

动态地根据变量及其滞后来命名变量。即value1.1,value1.2,value2.1,value2.2等。

Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.

在此先感谢您和让我知道您是否需要其他信息。如果有一种更简单的方法来获取我想要的东西,那么我无所不能。

Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.

推荐答案

您必须深入到tidyverse工具箱中,才能一次添加所有内容。如果您为每个 cats 值嵌套数据,则可以迭代嵌套的数据帧,迭代 values *

You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats, you can iterate over the nested data frames, iterating the lags over the values* columns in each.

library(tidyverse)
set.seed(47)

df <- data_frame(days = 1:9,
                 cats = rep(c('A','B','C'),3),
                 values1 = round(rnorm(9, 16, 4)),
                 values2 = round(rnorm(9, 16, 16)))


df %>% nest(-cats) %>%
    mutate(lags = map(data, function(dat) {
        imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x),
                                     paste0(.y, '_lag', 1:2)))
        })) %>%
    unnest() %>%
    arrange(days)
#> # A tibble: 9 x 8
#>   cats   days values1 values2 values1_lag1 values1_lag2 values2_lag1
#>   <chr> <int>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>
#> 1 A         1     24.     -7.          NA           NA           NA
#> 2 B         2     19.      1.          NA           NA           NA
#> 3 C         3     17.     17.          NA           NA           NA
#> 4 A         4     15.     24.          24.          NA           -7.
#> 5 B         5     16.    -13.          19.          NA            1.
#> 6 C         6     12.     17.          17.          NA           17.
#> 7 A         7     12.     27.          15.          24.          24.
#> 8 B         8     16.     15.          16.          19.         -13.
#> 9 C         9     15.     36.          12.          17.          17.
#> # ... with 1 more variable: values2_lag2 <dbl>

data.table :: shift 为此更简单,因为它是矢量化的。命名比实际滞后要花更多的工作:

data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:

library(data.table)

setDT(df)

df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2),
   by = cats, .SDcols = values1:values2][]
#>    days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1:    1    A      24      -7           NA           NA           NA
#> 2:    2    B      19       1           NA           NA           NA
#> 3:    3    C      17      17           NA           NA           NA
#> 4:    4    A      15      24           24           NA           -7
#> 5:    5    B      16     -13           19           NA            1
#> 6:    6    C      12      17           17           NA           17
#> 7:    7    A      12      27           15           24           24
#> 8:    8    B      16      15           16           19          -13
#> 9:    9    C      15      36           12           17           17
#>    values2_lag2
#> 1:           NA
#> 2:           NA
#> 3:           NA
#> 4:           NA
#> 5:           NA
#> 6:           NA
#> 7:           -7
#> 8:            1
#> 9:           17

这篇关于使用dplyr和for循环添加多个滞后变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 00:31