具有缺失值的时间序列的STL分解以进行异常检测

本文介绍了具有缺失值的时间序列的STL分解以进行异常检测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在缺少某些观测值的情况下，在气候数据的时间序列中检测异常值.在网上搜索，我发现了许多可用的方法.从消除趋势和季节成分并研究其余部分的意义上讲，其中的stl分解似乎很吸引人.阅读 STL:基于黄土的季节性趋势分解程序，stl似乎很灵活在确定用于分配可变性的设置时，不受异常值的影响，并且即使缺少值也可以应用.但是，尝试使用R，经过四年的观察并根据 http://stat.ethz.ch/R-manual/R-patched/library/stats/html/stl.html ，我遇到了错误:

I am trying to detect anomalous values in a time series of climatic data with some missing observations. Searching the web I found many available approaches. Of those, stl decomposition seems appealing, in the sense of removing trend and seasonal components and studying the remainder. Reading STL: A Seasonal-Trend Decomposition Procedure Based on Loess, stl appears to be flexible in determining the settings for assigning variability, unaffected by outliers and possible to apply despite missing values. However, trying to apply it in R, with four years of observations and defining all the parameters according to http://stat.ethz.ch/R-manual/R-patched/library/stats/html/stl.html , I encounter error:

na.action = na.omit和

当na.action = na.exclude时.

我仔细检查了频率是否正确定义.我在博客中看到了相关的问题，但是没有找到任何可以解决此问题的建议.不能在缺少值的系列中应用stl吗?我非常不愿意对它们进行插值，因为我不想引入(并因此检测...)工件.出于同样的原因，我不知道改为使用ARIMA方法是多么明智(如果缺少值仍然是个问题).

I have double checked that the frequency is correctly defined. I have seen relevant questions in blogs, but didn't find any suggestion that could solve this. Is it not possible to apply stl in a series with missing values? I am very reluctant to interpolate them, as I do not want to be introducing (and consequently detecting...) artifacts. For the same reason, I do not know how advisable it would be to use ARIMA approaches instead (and if missing values would still be a problem).

如果您知道在缺少值的系列中应用stl的方法，或者您认为我的选择在方法上不合理，或者您有任何更好的建议，请分享.我是该领域的新手，但堆满了(貌似...)相关信息.

Please share if you know a way to apply stl in a series with missing values, or if you believe my choices are methodologically not sound, or if you have any better suggestion. I am quite new in the field and overwhelmed by the heaps of (seemingly...) relevant information.

推荐答案

在stl的开头，我们找到

x <- na.action(as.ts(x))

不久之后

period <- frequency(x)
if (period < 2 || n <= 2 * period) 
    stop("series is not periodic or has less than two periods")

也就是说，stl期望x是ts之后的ts对象(否则是period == 1).让我们首先检查na.omit和na.exclude.

That is, stl expects x to be ts object after na.action(as.ts(x)) (otherwise period == 1). Let us check na.omit and na.exclude first.

很明显，在getAnywhere("na.omit.ts")的末尾找到

if (any(is.na(object))) 
    stop("time series contains internal NAs")

很简单，什么也做不了(na.omit并不从ts对象中排除NAs).现在getAnywhere("na.exclude.default")排除了NA观察值，但返回了一个exclude类的对象:

which is straightforward and nothing can be done (na.omit does not exclude NAs from ts objects). Now getAnywhere("na.exclude.default") excludes NA observations, but returns an object of class exclude:

    attr(omit, "class") <- "exclude"

，这是另一种情况.如上所述，stl期望na.action(as.ts(x))为ts，但是na.exclude(as.ts(x))属于类exclude.

and this is a different situation. As mentioned above, stl expects na.action(as.ts(x)) to be ts, but na.exclude(as.ts(x)) is of class exclude.

因此，如果人们对NAs排除感到满意，那么例如

Hence if one is satisfied with NAs exclusion then e.g.

nottem[3] <- NA
frequency(nottem)
# [1] 12
na.new <- function(x) ts(na.exclude(x), frequency = 12)
stl(nottem, na.action = na.new, s.window = "per")

有效.通常，stl不适用于NA值(即，使用na.action = na.pass)，它在Fortran中崩溃更深(请参见完整的源代码此处):

works. In general, stl does not work with NA values (i.e. with na.action = na.pass), it crashes deeper in Fortran (see full source code here):

z <- .Fortran(C_stl, ...

na.new的替代方法并不令人愉快:

Alternatives to na.new are not delightful:

na.contaguous-查找时间序列对象中最长的连续非缺失值.
na.approx，zoo中的na.locf或其他一些插值函数.
不确定这一点，但可以找到另一种适用于Python的Fortran实现此处.如果进行了一些修改，则可以使用可能从源代码安装R的Python，以防该模块确实允许缺少值.

na.contaguous - finds the longest consecutive stretch of non-missing values in a time series object.
na.approx, na.locf from zoo or some other interpolation function.
Not sure about this one, but another one Fortran implementation can be found for Python here. One could use Python of possibly install R from source after some modifications, in case this module really allows missing values.

我们在纸中可以看到对于缺失值(例如在一开始就对其进行近似)，没有简单的过程可以在调用stl之前应用于时间序列.因此，请考虑以下事实:原始实现我会考虑很长的时间，而不是考虑全新的实现.

As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling stl. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.

更新:从zoo中选择na.approx可能是na.approx的最佳选择，因此让我们检查其性能，即将stl的结果与完整的结果进行比较NAs使用na.approx具有一定数量的NAs时的数据集和结果.我正在使用 MAPE 作为准确性的度量标准，但仅用于趋势，因为季节性因素和余数相交零，这会扭曲结果. NAs的位置是随机选择的.

Update: a quite optimal in many aspects choice when having NAs could be na.approx from zoo, so let us check its performance, i.e. compare results of stl with full data set and results when having some number of NAs, using na.approx. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions for NAs are chosen at random.

library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100)

stlCheck <- function(data, p = 3, ...){
  set.seed(20130201)
  pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
  datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
  original <- data.frame(stl(data, ...)$time.series, stringsAsFactors = FALSE)
  original$id <- "Original"
  datasetsNA <- lapply(datasetsNA, function(x) 
    data.frame(stl(x, na.action = na.approx, ...)$time.series, 
               id = paste(sum(is.na(x)), "NAs"), 
               stringsAsFactors = FALSE))
  stlAll <- rbind.fill(c(list(original), datasetsNA))
  stlAll$Date <- time(data)
  stlAll <- melt(stlAll, id.var = c("id", "Date"))
  results <- data.frame(trend = sapply(lapply(datasetsNA, '[', i = "trend"), mape, original[, "trend"]))
  results$id <- paste(3^(0:p), "NAs")
  results <- melt(results, id.var = "id")
  results$x <- min(stlAll$Date) + diff(range(stlAll$Date)) / 4
  results$y <- min(original[, "trend"]) + diff(range(original[, "trend"])) / (4 * p) * (0:p)
  results$value <- round(results$value, 2)
  ggplot(stlAll, aes(x = Date, y = value, colour = id, group = id)) + geom_line() + 
    facet_wrap(~ variable, scales = "free_y") + theme_bw() +
    theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) + 
    labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
    lapply(unique(results$id), function(z)
      geom_text(data = results, colour = "black", size = 3,
                aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}

nottem，240个观测值

stlCheck(nottem, s.window = 4, t.window = 50, t.jump = 1)

co2，468个观察结果

stlCheck(log(co2), s.window = 21)

mdeaths，有72个观察结果

stlCheck(mdeaths, s.window = "per")

从视觉上看，案例1和案例3确实存在一些趋势差异，但考虑到样本量，这些差异在1中很小，在3中也令人满意(72).

Visually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).

这篇关于具有缺失值的时间序列的STL分解以进行异常检测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！