问题描述
我有一个数据框,我想删除那些NA率> 70%的列,或者占主导地位的值占据了99%的行。我怎么能在R中做到这一点?例如,如果我写:
isNARateLt70<-function(column){//某些代码}
适用(数据帧2,isNARateLt70) / p>
解决方案当我们有 colMeans
(感谢@MrFlick提供的更改 colSums()/ nrow()
的建议,并显示在此答案的底部)。
如果您以后要使用 sapply
,这就是我要如何使用您的函数。
> d<-data.frame(x = rep(NA,5),y = c(1,NA,NA,1,1),
z = c(rep(NA,3),1,2) )
> isNARateLt70<-function(x)平均值(is.na(x))< = 0.7
> sapply(d,isNARateLt70)
#xyz
#否是是是
然后,要使用上面的代码行将上面一行的数据作为子集,
> d [sapply(d,isNARateLt70)]
但是如上所述, colMeans
的工作原理相同,
> d [colMeans(is.na(d))< = 0.7]
#yz
#1 1 NA
#2 NA NA
#3 NA NA
#4 1 1
#5 1 2
I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
解决方案 There's really no need to write a function when we have colMeans
(thanks @MrFlick for the advice to change from colSums()/nrow()
, and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply
on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans
works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
这篇关于使用逻辑向量子集列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!