问题描述
当前数据帧由数值组成。
我在数据列中逐列标识异常值,
我可以一次在列中标识异常值并一次性删除吗?
现在我将值更改为NA
Current data frame consists of numerical values.I am identifying outliers in my dataframe column by column,can I identify the outliers in the column at once and remove them in one go?Right now I am changing the values to NA
我的代码:
quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA
数据。
OP在 dput
格式的注释中发布的数据。
Data posted by the OP in a comment in dput
format.
df1 <-
structure(list(Var1 = c(100.2, 110, 200, 456, 120000),
var2 = c(NA, 4545L, 45465L, 44422L, 250000L),
var3 = c(NA, 210000L, 91500L, 215000L, 250000L),
var4 = c(0.983, 0.44, 0.983, 0.78, 2.23)),
class = "data.frame", row.names = c(NA, -5L))
推荐答案
以下功能测试,列中的哪些值在Tukey的防护范围之外(第1个和第3个四分位数下方和上方的异常值)。然后,根据用户的喜好,该函数使用异常值删除包含任何值的所有行,或将异常值替换为 NA
。
The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA
.
outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
# create a place for identification of outliers
tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
# test, which cells contain outliers, ignoring existing NA values
for(i in 1:ncol(dat)){
qq <- quantile(dat[, i], q, na.rm = TRUE)
tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
}
if(out){
# removes lines with outliers
dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
} else {
# replaces outliers with NA
dat[tests] <- NA
}
return(dat)
}
outlier.out(df1)
# Var1 var2 var3 var4
# 4 456 44422 215000 0.78
outlier.out(df1, out = FALSE)
# Var1 var2 var3 var4
# 1 NA NA NA 0.983
# 2 110 NA 210000 NA
# 3 200 45465 NA 0.983
# 4 456 44422 215000 0.780
# 5 NA NA NA NA
这篇关于识别R中数据框中的离群值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!