问题描述
这是更新对的扩展基于名称中的模式的列.因此,这部分是出于好奇心,部分是出于娱乐目的.
This is an extension of Update pairs of columns based on pattern in their names . Thus, this is partially motivated by curiosity and partially for entertainment.
在开发该问题的答案时,我突然想到这可能是 for
循环比 *apply
函数更有效的情况之一(并且我一直在寻找一个很好的例子来说明 *apply
不一定比结构良好的 for
循环更有效").所以我想再次提出这个问题,并询问是否有人能够使用 *apply
函数(或 purr
,如果那是你的东西)比我在下面写的 for
循环更好.性能将根据我的笔记本电脑(运行 R 3.3.2 的廉价 Windows 机器)上的 microbenchmark
评估的执行时间来判断.
While developing an answer to that question, it occurred to me that this may be one of those cases where a for
loop is more efficient than an *apply
function (and I've been looking for a good illustration of the fact that *apply
is not necessarily "more efficient" than a well constructed for
loop). So I'd like to pose the question again, and ask if anyone is able to write a solution using an *apply
function (or purr
if that's your thing) that performs better than the for
loop I've written below. Performance will be judged on execution time as evaluated via microbenchmark
on my laptop (A cheap Windows box running R 3.3.2).
data.table
和 dplyr
建议也是受欢迎的.(我已经在计划如何利用我节省的所有微秒).
data.table
and dplyr
suggestions are welcome as well. (I'm already making plans for what I'll do with all the microseconds I save).
考虑数据框:
col_1 <- c(1,2,NA,4,5)
temp_col_1 <-c(12,2,2,3,4)
col_2 <- c(1,23,423,NA,23)
temp_col_2 <-c(1,2,23,4,5)
df_test <- data.frame(col_1, temp_col_1, col_2, temp_col_2)
set.seed(pi)
df_test <- df_test[sample(1:nrow(df_test), 1000, replace = TRUE), ]
对于每个col_x
,用temp_col_x
中的对应值替换缺失值.因此,例如:
For each col_x
, replace the missing values with the corresponding value in temp_col_x
. So, for example:
col_1 temp_col_1 col_2 temp_col_2
1 1 12 1 1
2 2 2 23 2
3 NA 2 423 23
4 4 3 NA 4
5 5 4 23 5
变成
col_1 temp_col_1 col_2 temp_col_2
1 1 12 1 1
2 2 2 23 2
3 2 2 423 23
4 4 3 4 4
5 5 4 23 5
现有解决方案
我已经写好的for
循环
temp_cols <- names(df_test)[grepl("^temp", names(df_test))]
cols <- sub("^temp_", "", temp_cols)
for (i in seq_along(temp_cols)){
row_to_replace <- which(is.na(df_test[[cols[i]]]))
df_test[[cols[i]]][row_to_replace] <- df_test[[temp_cols[i]]][row_to_replace]
}
到目前为止我最好的 apply
功能是:
My best apply
function so far is:
lapply(names(df_test)[grepl("^temp_", names(df_test))],
function(tc){
col <- sub("^temp_", "", tc)
row_to_replace <- which(is.na(df_test[[col]]))
df_test[[col]][row_to_replace] <<- df_test[[tc]][row_to_replace]
})
基准测试
随着(如果)建议的出现,我将开始在对此问题的编辑中显示基准.(代码现在是 Frank 答案的副本,但按照承诺在我的机器上运行 100 次)
Benchmarking
As (if) suggestions come in, I will begin showing benchmarks in edits to this question. (edit: code is now a copy of Frank's answer, but run 100 times on my machine, as promised)
library(magrittr)
library(data.table)
library(microbenchmark)
set.seed(pi)
nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
microbenchmark(times = 100,
for_vec = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[[col]][w] <- df_r[[col]][w]
}
}, lapply_vec = {
df_m <- df_m0
lapply(seq_along(df_m),
function(i){
w <- which(is.na(df_m[[i]]))
df_m[[i]][w] <<- df_r[[i]][w]
})
}, for_df = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[w, col] <- df_r[w, col]
}
}, lapply_df = {
df_m <- df_m0
lapply(seq_along(df_m),
function(i){
w <- which(is.na(df_m[[i]]))
df_m[w, i] <<- df_r[w, i]
})
}, mat = { # in lmo's answer
df_m <- df_m0
bah = is.na(df_m)
df_m[bah] = df_r[bah]
}, set = {
df_m <- copy(df_m0)
for (col in 1:nc){
w = which(is.na(df_m[[col]]))
set(df_m, i = w, j = col, v = df_r[w, col])
}
}
)
结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
for_vec 135.83875 157.84548 175.23005 166.60090 176.81839 502.0616 100 b
lapply_vec 135.67322 158.99496 179.53474 165.11883 178.06968 551.7709 100 b
for_df 173.95971 204.16368 222.30677 212.76608 224.78188 446.6050 100 c
lapply_df 181.46248 205.57069 220.38911 215.08505 223.98406 381.1006 100 c
mat 129.27835 154.01248 173.11378 159.83070 169.67439 453.0888 100 b
set 66.86402 81.08138 86.32626 85.51029 89.58331 123.1926 100 a
推荐答案
Data.table 提供了 set
函数来通过引用修改 data.tables 或 data.frames.
Data.table provides the set
function to modify data.tables or data.frames by reference.
这是一个在列数和行数方面更灵活的基准测试,它避开了 OP 中尴尬的列名内容:
Here's a benchmark that is more flexible with respect to numbers of cols and rows and that sidesteps the awkward column-name stuff in the OP:
library(magrittr)
nc = 1e3
nr = 1e2
df_m0 = sample(c(1:10, NA_integer_), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
df_r = sample(c(1:10), nc*nr, replace = TRUE) %>% matrix(nr, nc) %>% data.frame
library(data.table)
library(microbenchmark)
microbenchmark(times = 10,
for_vec = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[[col]][w] <- df_r[[col]][w]
}
}, lapply_vec = {
df_m <- df_m0
lapply(seq_along(df_m), function(i){
w <- which(is.na(df_m[[i]]))
df_m[[i]][w] <<- df_r[[i]][w]
})
}, for_df = {
df_m <- df_m0
for (col in 1:nc){
w <- which(is.na(df_m[[col]]))
df_m[w, col] <- df_r[w, col]
}
}, lapply_df = {
df_m <- df_m0
lapply(seq_along(df_m), function(i){
w <- which(is.na(df_m[[i]]))
df_m[w, i] <<- df_r[w, i]
})
}, mat = { # in lmo's answer
df_m <- df_m0
bah = is.na(df_m)
df_m[bah] = df_r[bah]
}, set = {
df_m <- copy(df_m0)
for (col in 1:nc){
w = which(is.na(df_m[[col]]))
set(df_m, i = w, j = col, v = df_r[w, col])
}
}
)
这给...
Unit: milliseconds
expr min lq mean median uq max neval
for_vec 77.06501 89.53430 100.10051 96.33764 106.13486 142.1329 10
lapply_vec 77.67366 89.04438 98.81510 99.08863 108.86491 117.2956 10
for_df 103.79097 130.33134 140.95398 144.46526 157.11335 161.4507 10
lapply_df 97.04616 114.17825 126.10633 131.20382 137.64375 149.7765 10
mat 73.47691 84.51473 100.16745 103.44476 112.58006 128.6166 10
set 44.32578 49.58586 62.52712 56.30460 71.63432 101.3517 10
评论:
如果我们调整
nc
和nr
或者NA
的频率,这四个选项的排名可能会发生变化.我猜列数越多,mat
方式(来自@lmo 的回答)和set
方式看起来就越好.
If we adjust
nc
andnr
or the frequency ofNA
s, the ranking of these four options might change. I guess the more cols there are, the better themat
way (from @lmo's answer) andset
way look.
set
测试中的 copy
比我们在实践中看到的要花费一些额外的时间,因为 set
函数只是通过引用修改表格(我认为与其他选项不同).
The copy
in the set
test takes some extra time beyond what we'd see in practice, since the set
function just modifies the table by reference (unlike the other options, I think).
这篇关于优化数据框中的替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!