本文介绍了识别R中数据框中的离群值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前数据帧由数值组成。
我在数据列中逐列标识异常值,
我可以一次在列中标识异常值并一次性删除吗?
现在我将值更改为NA

Current data frame consists of numerical values.I am identifying outliers in my dataframe column by column,can I identify the outliers in the column at once and remove them in one go?Right now I am changing the values to NA

我的代码:

    quantiles<-tapply(var1,names,quantile)
    minq <- sapply(names, function(x) quantiles[[x]]["25%"])
    maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
    var1[var1<minq | var1>maxq] <- NA

数据。

OP在 dput 格式的注释中发布的数据。

Data posted by the OP in a comment in dput format.

df1 <-
structure(list(Var1 = c(100.2, 110, 200, 456, 120000),
var2 = c(NA, 4545L, 45465L, 44422L, 250000L),
var3 = c(NA, 210000L, 91500L, 215000L, 250000L),
var4 = c(0.983, 0.44, 0.983, 0.78, 2.23)),
class = "data.frame", row.names = c(NA, -5L))


推荐答案

以下功能测试,列中的哪些值在Tukey的防护范围之外(第1个和第3个四分位数下方和上方的异常值)。然后,根据用户的喜好,该函数使用异常值删除包含任何值的所有行,或将异常值替换为 NA

The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA.

outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
    # create a place for identification of outliers
    tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
    # test, which cells contain outliers, ignoring existing NA values
    for(i in 1:ncol(dat)){
        qq <- quantile(dat[, i], q, na.rm = TRUE)
        tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
    }
    if(out){
        # removes lines with outliers
        dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
    } else {
        # replaces outliers with NA
        dat[tests] <- NA
    }
    return(dat)
}

outlier.out(df1)
#   Var1  var2   var3 var4
# 4  456 44422 215000 0.78


outlier.out(df1, out = FALSE)
#   Var1  var2   var3  var4
# 1   NA    NA     NA 0.983
# 2  110    NA 210000    NA
# 3  200 45465     NA 0.983
# 4  456 44422 215000 0.780
# 5   NA    NA     NA    NA

这篇关于识别R中数据框中的离群值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 16:30