问题描述
我正在尝试通过使用 na.approx()
插值从我的数据框中删除 NA
但无法删除所有 NA代码>s.
I am trying to remove NA
s from my data frame by interpolation with na.approx()
but can't remove all of the NA
s.
我的数据框是一个 4096x4096,以 270.15 作为无效值的标志.我需要数据在所有点上都是连续的,以提供气象模型.昨天我询问并获得了关于如何替换基于另一个数据帧的数据帧中的值的答案.但在那之后我来到 na.approx()
然后决定用 NA
替换 270.15 值并尝试 na.approx()
插值数据.但问题是为什么 na.approx()
不替换所有 NA.
My data frame is a 4096x4096 with 270.15 as flag for non valid value. I need data to be continous in all points to feed a meteorological model. Yesterday I asked, and obtained an answer, on how to replace values in a data frame based in another data frame. But after that I came to na.approx()
and then decided to replace the 270.15 values with NA
and try na.approx()
to interpolate data. But the question is why na.approx()
does not replace all NAs.
这就是我正在做的:
- 使用 hdf5load 读取原始 hdf 文件
- 子集数据框 (4094x4096)
用 NA 替换标志值
- Read the original hdf file with hdf5load
- Subset the data frame (4094x4096)
Substitute flag value with NA
> sst4[sst4 == 270.15 ] = NA
检查第一列(或任何其他列)
Check first column (or any other)
> summary(sst4[,1])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
271.3 276.4 285.9 285.5 292.3 302.8 1345.0
运行 na.approx
Run na.approx
> sst4=na.approx(sst4,na.rm="FALSE")
检查第一列
Check first column
> summary(sst4[,1])
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
271.3 276.5 286.3 285.9 292.6 302.8 411.0
如您所见,411 NA 尚未删除.为什么?它们是否都对应于前导/结束列值?
As you can see 411 NA's have not been removed. Why? Do they all correspond to leading/ending column values?
head(sst4[,1])
[1] NA NA NA NA NA NA
tail(sst4[,1])
[1] NA NA NA NA NA NA
na.approx 是否需要在 NA 之前和之后具有有效值才能进行插值?我需要设置任何其他 na.approx 选项吗?
Is it needed by na.approx to have valid values before and after NA to interpolate? Do I need to set any other na.approx option?
非常感谢
推荐答案
一个可重复的小例子:
library(zoo)
set.seed(1)
m <- matrix(runif(16, 0, 100), nrow = 4)
missing_values <- sample(16, 7)
m[missing_values] <- NA
m
[,1] [,2] [,3] [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 NA 6.178627 38.41037
[3,] NA NA NA NA
[4,] 90.82078 66.07978 NA NA
na.approx(m)
[,1] [,2] [,3] [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206 6.178627 38.41037
[3,] 64.01658 50.77592 NA NA
[4,] 90.82078 66.07978 NA NA
m[4, 4] <- 50
na.approx(m)
[,1] [,2] [,3] [,4]
[1,] 26.55087 20.16819 62.911404 68.70228
[2,] 37.21239 35.47206 6.178627 38.41037
[3,] 64.01658 50.77592 NA 44.20519
[4,] 90.82078 66.07978 NA 50.00000
是的,看起来您确实需要知道列的开始/结束值,否则插值不起作用.你能猜出你的边界值吗?
Yup, looks like you do need the start/end values of columns to be known or the interpolation doesn't work. Can you guess values for your boundaries?
另一个所以默认情况下,您需要知道列的开始和结束值.但是,可以通过传递 rule = 2
使 na.approx
始终填充空白.见菲利克斯的回答.根据 Gabor 的评论,您还可以使用 na.fill
提供默认值.最后,您可以在两个方向上插入边界条件(见下文)或猜测边界条件.
ANOTHER So by default, you need the start and end values of columns to be known. However it is possible to get na.approx
to always fill in the blanks by passing rule = 2
. See Felix's answer. You can also use na.fill
to provide a default value, as per Gabor's comment. Finally, you can interpolate boundary conditions in two directions (see below) or guess boundary conditions.
进一步的想法.由于 na.approx
仅在列中进行插值,并且您的数据是空间的,因此在行中进行插值可能也很有用.然后你可以取平均值.
A further thought. Since na.approx
is only interpolating in columns, and your data is spacial, perhaps interpolating in rows would be useful too. Then you could take the average.
na.approx
当整列都是 NA
时失败,所以我们创建了一个更大的数据集.
na.approx
fails when whole columns are NA
, so we create a bigger dataset.
set.seed(1)
m <- matrix(runif(64, 0, 100), nrow = 8)
missing_values <- sample(64, 15)
m[missing_values] <- NA
双向运行 na.approx
.
by_col <- na.approx(m)
by_row <- t(na.approx(t(m)))
找出最佳猜测.
default <- 50
best_guess <- ifelse(is.na(by_row),
ifelse(
is.na(by_col),
default, #neither known
by_col #only by_col known
),
ifelse(
is.na(by_col),
by_row, #only by_row known
(by_row + by_col) / 2 #both known
)
)
这篇关于使用 na.approx 在数据框中插入 NA 值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!