问题描述
我对dplyr和data.table在我的data.frame上创建一个新变量并决定比较方法的时间不满意。
I was unhappy with the time dplyr and data.table were taking to create a new variable on my data.frame and decide to compare methods.
令我惊讶的是,将dplyr :: mutate()的结果重新分配给新的data.frame似乎比不这样做更快。
To my surprise, reassigning the results of dplyr::mutate() to a new data.frame seems to be faster than not doing so.
为什么会这样?
library(data.table)
library(tidyverse)
dt <- fread(".... data.csv") #load 200MB datafile
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a
c <- Sys.time()
dt_dplyr <- dt2 %>%
mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c
e <- Sys.time()
dt3 %>%
mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e
datatabletook = 17sec
dplyrtook = 47sec
dplyr_reassign_took = 17sec
推荐答案
有几种方法可以:
.t0 <- Sys.time()
...
.t1 <- Sys.time()
.t1 - t0
# or
system.time({
...
})
使用 Sys.time
方式,您正在将每一行发送到控制台,并且可能会看到每行打印一些返回值,如@Axeman所建议。使用 {...}
,只有一个返回值(括号内的最后一个结果)和 system.time
将抑制打印。
With the Sys.time
way, you're sending each line to the console and may see some return value printed for each line, as @Axeman suggested. With {...}
, there is only one return value (the last result inside the braces) and system.time
will suppress it from printing.
如果打印成本很高,但不属于您要衡量的范围,则可以有所作为。
If the printing is costly enough but is not part of what you want to measure, it can make a difference.
有充分的理由更喜欢 system.time
而不是 Sys.time
进行基准测试;来自@MattDowle的评论:
There are good reasons to prefer system.time
over Sys.time
for benchmarking; from @MattDowle's comment:
ii)它包括个用户
和 sys
时间以及已用
挂钟时间。
ii) it includes user
and sys
time as well as elapsed
wall clock time.
Sys.time()
的方式会在测试过程中通过在Chrome中读取电子邮件或使用Excel受到影响运行时,只要您使用 user
和<$ c $, system.time()
方式就不会c> sys 部分结果。
The Sys.time()
way will be affected by reading your email in Chrome or using Excel while the test runs, the system.time()
way won't so long as you use the user
and sys
parts of the result.
这篇关于为什么要在dplyr中将新名称重新分配给dataframe使其速度更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!