问题描述
我刚开始使用R并执行了以下语句:
I've just started with R and I've executed these statements:
library(datasets)
head(airquality)
s <- split(airquality,airquality$Month)
sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)})
lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) })
对于sapply
,它返回以下内容:
For the sapply
, it returns the following:
5 6 7 8 9
Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind 11.62258 10.26667 8.941935 8.793548 10.18000
对于lapply
,它返回以下内容:
And for lapply
, it returns the following:
$`5`
Ozone Solar.R Wind
24.12500 182.04167 11.50417
$`6`
Ozone Solar.R Wind
29.44444 184.22222 12.17778
$`7`
Ozone Solar.R Wind
59.115385 216.423077 8.523077
$`8`
Ozone Solar.R Wind
60.00000 173.08696 8.86087
$`9`
Ozone Solar.R Wind
31.44828 168.20690 10.07586
现在,我的问题是,为什么返回的值相似但不相同? na.rm = TRUE
和na.omit
是否应该做完全相同的事情?忽略缺失的值,仅计算我们拥有的值的均值?在那种情况下,我应该在两个结果集中都没有相同的值吗?
Now, my question would be, why are the returned values similar, but not the same? Isn't na.rm = TRUE
and na.omit
supposed to be doing the exact same thing? Omit the missing values and calculate the mean only for the values that we have? And in that case, shouldn't I have had the same values in both result sets?
非常感谢您的投入!
推荐答案
它们不应给出相同的结果.考虑以下示例:
They are not supposed to give the same result. Consider this example:
exdf<-data.frame(a=c(1,NA,5),b=c(3,2,2))
# a b
#1 1 3
#2 NA 2
#3 5 2
colMeans(exdf,na.rm=TRUE)
# a b
#3.000000 2.333333
colMeans(na.omit(exdf))
# a b
#3.0 2.5
这是为什么?在第一种情况下,列b
的平均值是通过(3+2+2)/3
计算的.在第二种情况下,第二行将全部删除 (也是b
的值,该值不是NA,因此在第一种情况下考虑),因此由na.omit
删除,因此平均值仅为(3+2)/2
.
Why is this? In the first case, the mean of column b
is calculated through (3+2+2)/3
. In the second case, the second row is removed in its entirety (also the value of b
which is not-NA and therefore considered in the first case) by na.omit
and so the b
mean is just (3+2)/2
.
这篇关于R中na.rm和na.omit的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!