向data.frame中添加准确比例的随机缺失值 | frame中添加准确比例的随机缺失值

本文介绍了向data.frame中添加准确比例的随机缺失值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在R中的data.frame中添加随机NA.到目前为止，我已经研究了以下问题:

I would like to add random NA to a data.frame in R. So far I've looked into these questions:

R:将NA随机地按比例插入数据帧

如何将随机NA添加到数据框中

How do I add random NAs into a data frame

将随机缺失值添加到完整的数据框中(在R中)

这里提供了许多解决方案，但我找不到符合以下5个条件的解决方案:

Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:

添加真正随机的NA，并且按行或按列添加的金额不相同
使用data.frame中可能遇到的每类变量(数字，字符，因子，逻辑，ts ..)，因此输出必须与输入data.frame或矩阵具有相同的格式. /li>
保证输出中NA的确切数量或比例 [note] (许多解决方案产生的NA数量较少，因为在同一位置生成了许多NA)
对于大型数据集，计算效率很高.
添加NA的比例/数量，与输入中已经存在的NA无关.

Add really random NA, and not the same amount by row or by column
Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
Is computationnaly efficient for big datasets.
Add the proportion/number of NA independently of already present NA in the input.

有人有主意吗?我已经尝试编写一个函数来执行此操作(在第一个链接的答案中)，但它不符合N°3& 4点.谢谢.

Anyone has an idea?I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4.Thanks.

[note]确切的比例，当然以+/- 1NA取整.

推荐答案

这是我针对library(imputeMulti)上的论文的方法，该方法目前正在JSS上进行审查.这样会将NA插入整个数据集中的随机百分比中，并且可以很好地缩放.由于n * p * pctNA %% 1 != 0的情况，它不能保证一个精确的数字.

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}

很明显，您应该使用随机种子来提高可重复性，可以在函数调用之前指定该种子.

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

这是创建基线数据集以便在插补方法之间进行比较的一般策略.我相信这是您想要的，尽管您的问题(如评论中所述)并未明确说明.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

编辑:我确实认为x已完成.因此，我不确定它将如何处理现有的丢失数据.您当然可以根据需要修改代码，尽管这可能会使运行时间至少增加O(n * p)

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

这篇关于向data.frame中添加准确比例的随机缺失值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！