本文介绍了使用逻辑向量子集列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想删除那些NA率> 70%的列,或者占主导地位的值占据了99%的行。我怎么能在R中做到这一点?例如,如果我写:

  isNARateLt70<-function(column){//某些代码} 
适用(数据帧2,isNARateLt70) / p>

解决方案

当我们有 colMeans (感谢@MrFlick提供的更改 colSums()/ nrow()的建议,并显示在此答案的底部)。



如果您以后要使用 sapply ,这就是我要如何使用您的函数。

 > d<-data.frame(x = rep(NA,5),y = c(1,NA,NA,1,1),
z = c(rep(NA,3),1,2) )

> isNARateLt70<-function(x)平均值(is.na(x))< = 0.7
> sapply(d,isNARateLt70)
#xyz
#否是是是

然后,要使用上面的代码行将上面一行的数据作为子集,

 > d [sapply(d,isNARateLt70)] 

但是如上所述, colMeans 的工作原理相同,

 > d [colMeans(is.na(d))< = 0.7] 
#yz
#1 1 NA
#2 NA NA
#3 NA NA
#4 1 1
#5 1 2


I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?

I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:

isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)

Then how can I continue to use this vector to subset dataframe?

解决方案

There's really no need to write a function when we have colMeans (thanks @MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).

Here's how I would approach your function if you want to use sapply on it later.

> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
                  z = c(rep(NA, 3), 1, 2))

> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
#     x     y     z
# FALSE  TRUE  TRUE

Then, to subset with the above line your data using the above line of code, it's

> d[sapply(d, isNARateLt70)]

But as mentioned, colMeans works just the same,

> d[colMeans(is.na(d)) <= 0.7]
#    y  z
# 1  1 NA
# 2 NA NA
# 3 NA NA
# 4  1  1
# 5  1  2

这篇关于使用逻辑向量子集列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 16:25