问题描述
首先:感谢@MattDowle;data.table
是最好的东西之一自从我开始使用 R
以来,我从未遇到过.
First of all: thanks to @MattDowle; data.table
is among the best things thatever happened to me since I started using R
.
第二:我知道变量列的各种用例有很多变通方法data.table
中的名称,包括:
Second: I am aware of many workarounds for various use cases of variable columnnames in data.table
, including:
- 选择/分配给data.table变量的名称是存储在字符向量中
- 使用变量在data.table中传递列名在R
- 按保存的名称引用 data.table 列在变量中
- 以编程方式将列名传递给 data.table
- Data.table 元编程
- 如何写调用data.table的函数的函数?
- 在`data.table`中使用动态列名
- data.table,R中的动态列名李>
- 在 data.table 中使用 := 分配多个列,按组
- 在分组依据"中设置列名用data.table操作
- R 用 data.table 汇总多列
可能还有更多我没有参考的内容.
and probably more I haven't referenced.
但是:即使我学会了上面记录的所有技巧,以至于我从来不用查找它们来提醒自己如何使用它们,我仍然会发现使用作为参数传递给函数的列名是一项极其繁琐的任务.
But: even if I learned all the tricks documented above to the point that Inever had to look them up to remind myself how to use them, I still would findthat working with column names that are passed as parameters to a function isan extremely tedious task.
我正在寻找的是经过最佳实践批准"的替代方案到以下解决方法/工作流程.考虑我有一堆类似数据的列,并且想对这些列或它们的集合执行一系列类似的操作,其中操作具有任意高的复杂性,并且列名组传递给指定的每个操作一个变量.
What I'm looking for is a "best-practices-approved" alternativeto the following workaround / workflow. Considerthat I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.
我意识到这个问题听起来是人为的,但我却以惊人的频率遇到它.这些示例通常非常混乱,很难区分出与此问题相关的功能,但我最近偶然发现了一个相当简单的示例,可以在此处用作 MWE:
I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:
library(data.table)
library(lubridate)
library(zoo)
the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
var3=var1/floor(runif(6,2,5)))]
# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
length.out=12,
by="1 month")),by=year]
# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")
for(varname in var.names) {
#As suggested in an answer to Link 3 above
#Convert the column name to a 'quote' object
quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
#Do this for every column name I'll need
varname <- quote.convert(varname)
anntot <- quote.convert(paste0(varname,".annual.total"))
monthly <- quote.convert(paste0(varname,".monthly"))
rolling <- quote.convert(paste0(varname,".rolling"))
scaled <- quote.convert(paste0(varname,".scaled"))
#Perform the relevant tasks, using eval()
#around every variable columnname I may want
new.table[,eval(anntot):=
the.table[,rep(eval(varname),each=12)]]
new.table[,eval(monthly):=
the.table[,rep(eval(varname)/12,each=12)]]
new.table[,eval(rolling):=
rollapply(eval(monthly),mean,width=12,
fill=c(head(eval(monthly),1),
tail(eval(monthly),1)))]
new.table[,eval(scaled):=
eval(anntot)/sum(eval(rolling))*eval(rolling),
by=year]
}
当然,这里对数据和变量的特定影响是无关紧要的,因此请不要关注它或建议改进以完成它在这种特殊情况下所完成的工作.相反,我正在寻找的是一种通用策略,用于重复将任意复杂的 data.table
操作过程应用于列列表或列列表列表的工作流程,在变量或作为参数传递给函数,其中过程必须以编程方式引用变量/参数中命名的列,并且可能包括更新、连接、分组、对 data.table
特殊对象的调用.I
、.SD
等;但是一种比上面的或其他需要频繁 quote
-ing 和 eval
-ing 的方法更简单、更优雅、更短或更容易设计、实现或理解的方法.
Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table
actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table
special objects .I
, .SD
, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote
-ing and eval
-ing.
请特别注意,由于过程可能相当复杂并且涉及重复更新 data.table
然后引用更新的列,标准 lapply(.SD,...), ... .SDcols = ...
方法通常不是可行的替代品.同样用 DT[[a.column.name]]
替换 eval(a.column.name)
的每个调用既不会简化很多,也不会完全正常工作,因为那样不会'据我所知,不能很好地处理其他 data.table
操作.
In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table
and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ...
approach is usually not a workable substitute. Also replacing each call of eval(a.column.name)
with DT[[a.column.name]]
neither simplifies much nor works completely in general since that doesn't play nice with the other data.table
operations, as far as I am aware.
推荐答案
您描述的问题与 data.table
没有严格关系.
复杂的查询不容易被翻译成机器可以解析的代码,因此我们无法避免为复杂的操作编写查询的复杂性.
您可以尝试想象如何使用 dplyr
或 SQL 以编程方式为以下 data.table
查询构造查询:
Problem you are describing is not strictly related to data.table
.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table
query using dplyr
or SQL:
DT[, c(f1(v1, v2, opt=TRUE),
f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
, by=.(id1, id2)]
假设所有列(id1
、id2
、v1
...v5
)甚至选项(opt
, opt1
, opt2
) 应该作为变量传递.
Assuming that all columns (id1
, id2
, v1
...v5
) or even options (opt
, opt1
, opt2
) should be passed as variables.
由于查询表达的复杂性,我认为您无法轻松完成问题中所述的要求:
Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:
比上面的或其他需要频繁 quote
-ing 和 eval
-ing 的方法更简单、更优雅、更短或更容易设计、实现或理解.
尽管与其他编程语言相比,base R 提供了非常有用的工具来处理此类问题.
Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.
您已经找到了使用 get
、mget
、DT[[col_name]]
、parse
的建议,quote
, eval
.
You already found suggestions to use get
, mget
, DT[[col_name]]
, parse
, quote
, eval
.
- 正如您所提到的,
DT[[col_name]]
可能无法很好地与data.table
优化配合使用,因此在这里没有多大用处. parse
可能是构建复杂查询的最简单方法,因为您可以对字符串进行操作,但它不提供基本的语言语法验证.因此,您最终可能会尝试解析 R 解析器不接受的字符串.此外,还有一个安全问题,如 2655#issuecomment-376781159一个>.get
/mget
是处理此类问题最常用的方法.get
和mget
在内部被[.data.table
捕获并转换为预期的列.因此,您假设您的任意复杂查询将能够被[.data.table
分解并正确输入预期的列.- 自从几年前您问过这个问题以来,最近推出了新功能 - dot-dot prefix.您使用点-点为变量名称添加前缀以引用当前 data.table 范围之外的变量.与您在文件系统中引用父目录类似.dot-dot 后面的内部结构与
get
非常相似,具有前缀的变量将在[.data.table
内去引用..在未来的版本中,点点前缀可能允许调用:
- As you mentioned
DT[[col_name]]
might not play well withdata.table
optimizations, thus is not that useful here. parse
is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159.get
/mget
are the ones most commonly suggested to deal with such problems.get
andmget
are internally catch by[.data.table
and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by[.data.table
and expected columns properly inputted.- Since you asked this question few years back, the new feature - dot-dot prefix - is being rolled out in recently. You prefix variable name using dot-dot to refer to a variable outside of the scope of current data.table. Similarly as you refer parent directory in file system. Internals behind dot-dot will be quite similar to
get
, variables having prefix will be de-referenced inside of[.data.table
. . In future releases dot-dot prefix may allow calls like:
col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]
我个人更喜欢
quote
和eval
.quote
和eval
几乎被解释为从头开始手写.此方法不依赖data.table
功能来管理对列的引用.我们可以期望所有优化的工作方式与您手动编写这些查询的方式相同.我发现它也更容易调试,因为您可以随时打印带引号的表达式来查看实际传递给data.table
查询的内容.此外,发生错误的空间更小.使用 R 语言对象构建复杂的查询有时很棘手,很容易将过程包装成函数,因此它可以应用于不同的用例并易于重用.需要注意的是,此方法独立于data.table
.它使用 R 语言结构.您可以在官方 R 语言定义中找到更多信息 在计算语言一章中.Personally I prefer
quote
andeval
instead.quote
andeval
is interpreted almost as written by hand from scratch. This method does not rely ondata.table
abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed todata.table
query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent fromdata.table
. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.还有什么?
- 我在 中提交了一个名为 macro 的新概念提案#1579.简而言之,它是
DT[eval(qi), eval(qj), eval(qby)]
的包装器,因此您仍然必须对 R 语言对象进行操作.欢迎您在此处发表评论. - 最近我在 PR#4304 中提出了另一种元编程接口方法.简而言之,它使用新参数
env
将基本 Rsubstitute
功能插入[.data.table
.
- I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on
DT[eval(qi), eval(qj), eval(qby)]
so you still have to operate on R language objects. You are welcome to put your comment there. - Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R
substitute
functionality into[.data.table
using new argumentenv
.
转到示例.下面我将展示两种解决方法.第一个将使用基本 R 元编程,第二个将对 PR 中提出的 data.table 使用元编程#4304(见上文).
Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).
- 基于语言的 R 计算
我会将所有逻辑包装到
do_vars
函数中.调用do_vars(donot=TRUE)
将打印要在data.table
而不是eval
上计算的表达式.下面的代码应该在 OP 代码之后运行.I will wrap all logic into
do_vars
function. Callingdo_vars(donot=TRUE)
will print expressions to be computed ondata.table
instead ofeval
them. Below code should be run just after the OP code.expected = copy(new.table) new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year] do_vars = function(x, y, vars, donot=FALSE) { name.suffix = function(x, suffix) as.name(paste(x, suffix, sep=".")) do_var = function(var, x, y) { substitute({ x[, .anntot := y[, rep(.var, each=12)]] x[, .monthly := y[, rep(.var/12, each=12)]] x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))] x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year] }, list( .var=as.name(var), .anntot=name.suffix(var, "annual.total"), .monthly=name.suffix(var, "monthly"), .rolling=name.suffix(var, "rolling"), .scaled=name.suffix(var, "scaled") )) } ql = lapply(setNames(nm=vars), do_var, x, y) if (donot) return(ql) lapply(ql, eval.parent) invisible(x) } do_vars(new.table, the.table, c("var1","var2","var3")) all.equal(expected, new.table) #[1] TRUE
我们可以预览查询
do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE) #$var1 #{ # x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])] # x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])] # x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12, # fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))] # x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) * # var1.rolling), by = year] #} # #$var2 #{ # x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])] # x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])] # x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12, # fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))] # x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) * # var2.rolling), by = year] #} # #$var3 #{ # x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])] # x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])] # x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12, # fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))] # x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) * # var3.rolling), by = year] #} #
- 提议的 data.table 元编程
expected = copy(new.table) new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year] name.suffix = function(x, suffix) as.name(paste(x, suffix, sep=".")) do_var2 = function(var, x, y) { x[, .anntot := y[, rep(.var, each=12)], env = list( .anntot = name.suffix(var, "annual.total"), .var = var )] x[, .monthly := y[, rep(.var/12, each=12)], env = list( .monthly = name.suffix(var, "monthly"), .var = var )] x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))), env = list( .rolling = name.suffix(var, "rolling"), .monthly = name.suffix(var, "monthly") )] x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year, env = list( .scaled = name.suffix(var, "scaled"), .anntot = name.suffix(var, "annual.total"), .rolling = name.suffix(var, "rolling") )] TRUE } sapply(setNames(nm=var.names), do_var2, new.table, the.table) #var1 var2 var3 #TRUE TRUE TRUE all.equal(expected, new.table) #[1] TRUE
数据和更新的 OP 代码
Data and updated OP code
library(data.table) library(lubridate) library(zoo) the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400))) the.table[,`:=`(var2=var1/floor(runif(6,2,5)), var3=var1/floor(runif(6,2,5)))] # Replicate data across months new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")),by=year] # Do a complicated procedure to each variable in some group. var.names <- c("var1","var2","var3") for(varname in var.names) { #As suggested in an answer to Link 3 above #Convert the column name to a 'quote' object quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')'))) #Do this for every column name I'll need varname <- quote.convert(varname) anntot <- quote.convert(paste0(varname,".annual.total")) monthly <- quote.convert(paste0(varname,".monthly")) rolling <- quote.convert(paste0(varname,".rolling")) scaled <- quote.convert(paste0(varname,".scaled")) #Perform the relevant tasks, using eval() #around every variable columnname I may want new.table[,paste0(varname,".annual.total"):= the.table[,rep(eval(varname),each=12)]] new.table[,paste0(varname,".monthly"):= the.table[,rep(eval(varname)/12,each=12)]] new.table[,paste0(varname,".rolling"):= rollapply(eval(monthly),mean,width=12, fill=c(head(eval(monthly),1), tail(eval(monthly),1)))] new.table[,paste0(varname,".scaled"):= eval(anntot)/sum(eval(rolling))*eval(rolling), by=year] }
这篇关于如何在 R 中的 data.table 中使用变量中的列名完全通用地工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
- 我在 中提交了一个名为 macro 的新概念提案#1579.简而言之,它是