问题描述
我想在R中的data.frame中添加随机NA
.到目前为止,我已经研究了以下问题:
I would like to add random NA
to a data.frame in R. So far I've looked into these questions:
How do I add random NA
s into a data frame
这里提供了许多解决方案,但我找不到符合以下5个条件的解决方案:
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
- 添加真正随机的NA,并且按行或按列添加的金额不相同
- 使用data.frame中可能遇到的每类变量(数字,字符,因子,逻辑,ts ..),因此输出必须与输入data.frame或矩阵具有相同的格式. /li>
- 保证输出中NA的确切数量或比例 [note] (许多解决方案产生的NA数量较少,因为在同一位置生成了许多NA)
- 对于大型数据集,计算效率很高.
- 添加NA的比例/数量,与输入中已经存在的NA无关.
- Add really random NA, and not the same amount by row or by column
- Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
- Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
- Is computationnaly efficient for big datasets.
- Add the proportion/number of NA independently of already present NA in the input.
有人有主意吗?我已经尝试编写一个函数来执行此操作(在第一个链接的答案中),但它不符合N°3& 4点.谢谢.
Anyone has an idea?I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4.Thanks.
[note]确切的比例,当然以+/- 1NA取整.
推荐答案
这是我针对library(imputeMulti)
上的论文的方法,该方法目前正在JSS上进行审查.这样会将NA
插入整个数据集中的随机百分比中,并且可以很好地缩放.由于n * p * pctNA %% 1 != 0
的情况,它不能保证一个精确的数字.
This is the way that I do it for my paper on library(imputeMulti)
which is currently in review at JSS. This inserts NA
's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0
.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
很明显,您应该使用随机种子来提高可重复性,可以在函数调用之前指定该种子.
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
这是创建基线数据集以便在插补方法之间进行比较的一般策略.我相信这是您想要的,尽管您的问题(如评论中所述)并未明确说明.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
编辑:我确实认为x
已完成.因此,我不确定它将如何处理现有的丢失数据.您当然可以根据需要修改代码,尽管这可能会使运行时间至少增加O(n * p)
Edit: I do assume that x
is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
这篇关于向data.frame中添加准确比例的随机缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!