R数据表通过组替换所有缺少的列的第一行

本文介绍了R数据表通过组替换所有缺少的列的第一行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 29岁程序员，3月因学历无情被辞！我有一个data.table，我试图做类似于 data [！is.na（variable）] 。然而，对于完全缺失的组，我想保留该组的第一行。所以，我试图子集使用by。我已经做了一些在线研究和解决方案，但我认为这是低效率。I have a data.table and I am trying to do something akin to data[ !is.na(variable) ]. However, for groups that are entirely missing, I'd like to just keep the first row of that group. So, I am trying to subset using by. I have done some research online and have a solution, but I think it is inefficient.我已经提供了一个示例，显示了我希望实现的内容，我想知道如果不创建两个额外的列，这是否可以实现。I've provided an example below showing what I am hoping to achieve, and I wonder if this can be done without creating the two extra columns.d_sample = data.table( ID = c(1, 1, 2, 2, 3, 3), Time = c(10, 15, 100, 110, 200, 220), Event = c(NA, NA, NA, 1, 1, NA))d_sample[ !is.na(Event), isValidOutcomeRow := T, by = ID]d_sample[ , isValidOutcomePatient := any(isValidOutcomeRow), by = ID]d_sample[ is.na(isValidOutcomePatient), isValidOutcomeRow := c(T, rep(NA, .N - 1)), by = ID]d_sample[ isValidOutcomeRow == T ] EDIT：以下是使用 thelatemail 和 Frank 解决方案与60K行的较大数据集进行的一些速度比较。 Here are some speed comparisons with thelatemail and Frank's solutions with a larger dataset with 60K rows.d_sample = data.table( ID = sort(rep(seq(1,30000), 2)), Time = rep(c(10, 15, 100, 110, 200, 220), 10000), Event = rep(c(NA, NA, NA, 1, 1, NA), 10000) ) thelatemail的解决方案在我的计算机上获得 20.65 的运行时。thelatemail's solution gets a runtime of 20.65 on my computer.system.time(d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)][1], by=ID]) Frank的第一个解决方案运行时间 0system.time( unique( d_sample[order(is.na(Event))], by="ID" ) ) Frank的第二个解决方案c $ c> 0.05system.time( d_sample[order(is.na(Event)), .SD[1L], by=ID] )推荐答案 p>这似乎有效：unique( d_sample[order(is.na(Event))], by="ID" ) ID Time Event1: 2 110 12: 3 200 13: 1 10 NA或者， d_sample [order（is.na（Event）），.SD [1L]，by = ID] / code>。Alternately, d_sample[order(is.na(Event)), .SD[1L], by=ID].扩展OP的示例，我也发现两种方法的类似时间： p>Extending the OP's example, I also find similar timings for the two approaches:n = 12e4 # must be a multiple of 6set.seed(1)d_sample = data.table( ID = sort(rep(seq(1,n/2), 2)), Time = rep(c(10, 15, 100, 110, 200, 220), n/6), Event = rep(c(NA, NA, NA, 1, 1, NA), n/6) )system.time(rf <- unique( d_sample[order(is.na(Event))], by="ID" ))# 1.17system.time(rf2 <- d_sample[order(is.na(Event)), .SD[1L], by=ID] )# 1.24system.time(rt <- d_sample[, if(all(is.na(Event))) .SD[1] else .SD[!is.na(Event)], by=ID])# 10.42system.time(rt2 <- d_sample[ d_sample[, { w = which(is.na(Event)); .I[ if (length(w) == .N) 1L else -w ] }, by=ID]$V1 ])# .13# verifyidentical(rf,rf2) # TRUEidentical(rf,rt) # FALSEfsetequal(rf,rt) # TRUEidentical(rt,rt2) # TRUE @ thelatemail解决方案的变体 rt2 。The variation on @thelatemail's solution rt2 is the fastest by a wide margin. 这篇关于R数据表通过组替换所有缺少的列的第一行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！