本文介绍了趋势长度 - 面板数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个平衡良好的面板数据集,其中包含 NA 观察结果.我将使用 LOCF,并想知道每个面板中有多少连续的 NA,然后再进行观察.LOCF 是一个过程,其中可以使用最后一次观察结转"填充"缺失值.这对一些时间序列应用程序来说是有意义的;也许我们有 5 分钟增量的天气数据:对缺失观测值的良好猜测可能是 5 分钟前进行的观测.

I have a well balanced panel data set which contains NA observations. I will be using LOCF, and would like to know how many consecutive NA's are in each panel, before carrying observations forward. LOCF is a procedure where by missing values can be "filled in" using the "last observation carried forward". This can make sense it some time-series applications; perhaps we have weather data in 5 minute increments: a good guess at the value of a missing observation might be an observation made 5 minutes earlier.

显然,在一个面板中将观察结果提前一小时比在同一面板中将相同的观察结果提前到下一年更有意义.

Obviously, it makes more sense to carry an observation forward one hour within one panel than it does to carry that same observation forward to the next year in the same panel.

我知道您可以使用 zoo::na.locf 设置maxgap"参数,但是,我想更好地了解我的数据.请看一个简单的例子:

I am aware that you can set a "maxgap" argument using zoo::na.locf, however, I want to get a better feel for my data. Please see a simple example:

require(data.table)
set.seed(12345)

### Create a "panel" data set
data <- data.table(id = rep(1:10, each = 10),
                   date = seq(as.POSIXct('2012-01-01'),
                              as.POSIXct('2012-01-10'),
                              by = '1 day'),
                   x  = runif(100))

### Randomly assign NA's to our "x" variable
na <- sample(1:100, size = 52)
data[na, x := NA]

### Calculate the max number of consecutive NA's by group...this is what I want:
### ID       Consecutive NA's
  #  1       1
  #  2       3
  #  3       3
  #  4       3
  #  5       4
  #  6       5
  #  ...
  #  10      2

### Count the total number of NA's by group...this is as far as I get:
data[is.na(x), .N, by = id]

欢迎所有解决方案,但高度首选 data.table 解决方案;数据文件很大.

All solutions are welcomed, but data.table solutions are highly preferred; the data file is large.

推荐答案

这样就可以了:

data[, max(with(rle(is.na(x)), lengths[values])), by = id]

我刚刚运行 rle 来查找所有连续的 NA 并选择了最大长度.

I just ran rle to find all consecutive NA's and picked the max length.

对于恢复上述 max 的日期范围的评论问题,这是一个相当复杂的答案:

Here's a rather convoluted answer to the comment question of recovering the date ranges for the above max:

data[, {
         tmp = rle(is.na(x));
         tmp$lengths[!tmp$values] = 0;  # modify rle result to ignore non-NA's
         n = which.max(tmp$lengths);    # find the index in rle of longest NA sequence

         tmp = rle(is.na(x));                   # let's get back to the unmodified rle
         start = sum(tmp$lengths[0:(n-1)]) + 1; # and find the start and end indices
         end   = sum(tmp$lengths[1:n]);

         list(date[start], date[end], max(tmp$lengths[tmp$values]))
       }, by = id]

这篇关于趋势长度 - 面板数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 02:58