中使用变量中的列名完全通用地工作

中使用变量中的列名完全通用地工作

本文介绍了如何在 R 中的 data.table 中使用变量中的列名完全通用地工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先:感谢@MattDowle;data.table 是最好的东西之一自从我开始使用 R 以来,我从未遇到过.

First of all: thanks to @MattDowle; data.table is among the best things thatever happened to me since I started using R.

第二:我知道变量列的各种用例有很多变通方法data.table 中的名称,包括:

Second: I am aware of many workarounds for various use cases of variable columnnames in data.table, including:

  1. 选择/分配给data.table变量的名称是存储在字符向量中
  2. 使用变量在data.table中传递列名在R
  3. 按保存的名称引用 data.table 列在变量中
  4. 以编程方式将列名传递给 data.table
  5. Data.table 元编程
  6. 如何写调用data.table的函数的函数?
  7. 在`data.table`中使用动态列名
  8. data.table,R中的动态列名
  9. 在 data.table 中使用 := 分配多个列,按组
  10. 在分组依据"中设置列名用data.table操作
  11. R 用 data.table 汇总多列

可能还有更多我没有参考的内容.

and probably more I haven't referenced.

但是:即使我学会了上面记录的所有技巧,以至于我从来不用查找它们来提醒自己如何使用它们,我仍然会发现使用作为参数传递给函数的列名是一项极其繁琐的任务.

But: even if I learned all the tricks documented above to the point that Inever had to look them up to remind myself how to use them, I still would findthat working with column names that are passed as parameters to a function isan extremely tedious task.

我正在寻找的是经过最佳实践批准"的替代方案到以下解决方法/工作流程.考虑我有一堆类似数据的列,并且想对这些列或它们的集合执行一系列类似的操作,其中操作具有任意高的复杂性,并且列名组传递给指定的每个操作一个变量.

What I'm looking for is a "best-practices-approved" alternativeto the following workaround / workflow. Considerthat I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.

我意识到这个问题听起来是人为的,但我却以惊人的频率遇到它.这些示例通常非常混乱,很难区分出与此问题相关的功能,但我最近偶然发现了一个相当简单的示例,可以在此处用作 MWE:

I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:

library(data.table)
library(lubridate)
library(zoo)

the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                var3=var1/floor(runif(6,2,5)))]

# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                           length.out=12,
                                           by="1 month")),by=year]

# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")

for(varname in var.names) {
    #As suggested in an answer to Link 3 above
    #Convert the column name to a 'quote' object
    quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))

    #Do this for every column name I'll need
    varname <- quote.convert(varname)
    anntot <- quote.convert(paste0(varname,".annual.total"))
    monthly <- quote.convert(paste0(varname,".monthly"))
    rolling <- quote.convert(paste0(varname,".rolling"))
    scaled <- quote.convert(paste0(varname,".scaled"))

    #Perform the relevant tasks, using eval()
    #around every variable columnname I may want
    new.table[,eval(anntot):=
               the.table[,rep(eval(varname),each=12)]]
    new.table[,eval(monthly):=
               the.table[,rep(eval(varname)/12,each=12)]]
    new.table[,eval(rolling):=
               rollapply(eval(monthly),mean,width=12,
                         fill=c(head(eval(monthly),1),
                                tail(eval(monthly),1)))]
    new.table[,eval(scaled):=
               eval(anntot)/sum(eval(rolling))*eval(rolling),
              by=year]
}

当然,这里对数据和变量的特定影响是无关紧要的,因此请不要关注它或建议改进以完成它在这种特殊情况下所完成的工作.相反,我正在寻找的是一种通用策略,用于重复将任意复杂的 data.table 操作过程应用于列列表或列列表列表的工作流程,在变量或作为参数传递给函数,其中过程必须以编程方式引用变量/参数中命名的列,并且可能包括更新、连接、分组、对 data.table 特殊对象的调用.I.SD等;但是一种比上面的或其他需要频繁 quote-ing 和 eval-ing 的方法更简单、更优雅、更短或更容易设计、实现或理解的方法.

Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table special objects .I, .SD, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.

请特别注意,由于过程可能相当复杂并且涉及重复更新 data.table 然后引用更新的列,标准 lapply(.SD,...), ... .SDcols = ... 方法通常不是可行的替代品.同样用 DT[[a.column.name]] 替换 eval(a.column.name) 的每个调用既不会简化很多,也不会完全正常工作,因为那样不会'据我所知,不能很好地处理其他 data.table 操作.

In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ... approach is usually not a workable substitute. Also replacing each call of eval(a.column.name) with DT[[a.column.name]] neither simplifies much nor works completely in general since that doesn't play nice with the other data.table operations, as far as I am aware.

推荐答案

您描述的问题与 data.table 没有严格关系.
复杂的查询不容易被翻译成机器可以解析的代码,因此我们无法避免为复杂的操作编写查询的复杂性.
您可以尝试想象如何使用 dplyrSQL 以编程方式为以下 data.table 查询构造查询:

Problem you are describing is not strictly related to data.table.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table query using dplyr or SQL:

DT[, c(f1(v1, v2, opt=TRUE),
       f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
       lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
   , by=.(id1, id2)]

假设所有列(id1id2v1...v5)甚至选项(opt, opt1, opt2) 应该作为变量传递.

Assuming that all columns (id1, id2, v1...v5) or even options (opt, opt1, opt2) should be passed as variables.

由于查询表达的复杂性,我认为您无法轻松完成问题中所述的要求:

Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:

比上面的或其他需要频繁 quote-ing 和 eval-ing 的方法更简单、更优雅、更短或更容易设计、实现或理解.

尽管与其他编程语言相比,base R 提供了非常有用的工具来处理此类问题.

Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.

您已经找到了使用 getmgetDT[[col_name]]parse 的建议,quote, eval.

You already found suggestions to use get, mget, DT[[col_name]], parse, quote, eval.

col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]

  • 我个人更喜欢 quoteeval.quoteeval 几乎被解释为从头开始手写.此方法不依赖 data.table 功能来管理对列的引用.我们可以期望所有优化的工作方式与您手动编写这些查询的方式相同.我发现它也更容易调试,因为您可以随时打印带引号的表达式来查看实际传递给 data.table 查询的内容.此外,发生错误的空间更小.使用 R 语言对象构建复杂的查询有时很棘手,很容易将过程包装成函数,因此它可以应用于不同的用例并易于重用.需要注意的是,此方法独立于 data.table.它使用 R 语言结构.您可以在官方 R 语言定义中找到更多信息计算语言一章中.

    • Personally I prefer quote and eval instead. quote and eval is interpreted almost as written by hand from scratch. This method does not rely on data.table abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed to data.table query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent from data.table. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.

      还有什么?

      • 我在 中提交了一个名为 macro 的新概念提案#1579.简而言之,它是 DT[eval(qi), eval(qj), eval(qby)] 的包装器,因此您仍然必须对 R 语言对象进行操作.欢迎您在此处发表评论.
      • 最近我在 PR#4304 中提出了另一种元编程接口方法.简而言之,它使用新参数 env 将基本 R substitute 功能插入 [.data.table.
      • I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on DT[eval(qi), eval(qj), eval(qby)] so you still have to operate on R language objects. You are welcome to put your comment there.
      • Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R substitute functionality into [.data.table using new argument env.

      转到示例.下面我将展示两种解决方法.第一个将使用基本 R 元编程,第二个将对 PR 中提出的 data.table 使用元编程#4304(见上文).

      Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).

      • 基于语言的 R 计算

      我会将所有逻辑包装到 do_vars 函数中.调用 do_vars(donot=TRUE) 将打印要在 data.table 而不是 eval 上计算的表达式.下面的代码应该在 OP 代码之后运行.

      I will wrap all logic into do_vars function. Calling do_vars(donot=TRUE) will print expressions to be computed on data.table instead of eval them. Below code should be run just after the OP code.

      expected = copy(new.table)
      new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
      
      do_vars = function(x, y, vars, donot=FALSE) {
        name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
        do_var = function(var, x, y) {
          substitute({
            x[, .anntot := y[, rep(.var, each=12)]]
            x[, .monthly := y[, rep(.var/12, each=12)]]
            x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))]
            x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year]
          }, list(
            .var=as.name(var),
            .anntot=name.suffix(var, "annual.total"),
            .monthly=name.suffix(var, "monthly"),
            .rolling=name.suffix(var, "rolling"),
            .scaled=name.suffix(var, "scaled")
          ))
        }
        ql = lapply(setNames(nm=vars), do_var, x, y)
        if (donot) return(ql)
        lapply(ql, eval.parent)
        invisible(x)
      }
      do_vars(new.table, the.table, c("var1","var2","var3"))
      all.equal(expected, new.table)
      #[1] TRUE
      

      我们可以预览查询

      do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE)
      #$var1
      #{
      #    x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])]
      #    x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])]
      #    x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12,
      #        fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))]
      #    x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) *
      #        var1.rolling), by = year]
      #}
      #
      #$var2
      #{
      #    x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])]
      #    x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])]
      #    x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12,
      #        fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))]
      #    x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) *
      #        var2.rolling), by = year]
      #}
      #
      #$var3
      #{
      #    x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])]
      #    x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])]
      #    x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12,
      #        fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))]
      #    x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) *
      #        var3.rolling), by = year]
      #}
      #
      

      • 提议的 data.table 元编程
      • expected = copy(new.table)
        new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
        
        name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
        do_var2 = function(var, x, y) {
          x[, .anntot := y[, rep(.var, each=12)],
            env = list(
              .anntot = name.suffix(var, "annual.total"),
              .var = var
            )]
          x[, .monthly := y[, rep(.var/12, each=12)],
            env = list(
              .monthly = name.suffix(var, "monthly"),
              .var = var
            )]
          x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))),
            env = list(
              .rolling = name.suffix(var, "rolling"),
              .monthly = name.suffix(var, "monthly")
            )]
          x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year,
            env = list(
              .scaled = name.suffix(var, "scaled"),
              .anntot = name.suffix(var, "annual.total"),
              .rolling = name.suffix(var, "rolling")
            )]
          TRUE
        }
        
        sapply(setNames(nm=var.names), do_var2, new.table, the.table)
        #var1 var2 var3
        #TRUE TRUE TRUE
        all.equal(expected, new.table)
        #[1] TRUE
        


        数据和更新的 OP 代码


        Data and updated OP code

        library(data.table)
        library(lubridate)
        library(zoo)
        
        the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
        the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                        var3=var1/floor(runif(6,2,5)))]
        
        # Replicate data across months
        new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                                   length.out=12,
                                                   by="1 month")),by=year]
        
        # Do a complicated procedure to each variable in some group.
        var.names <- c("var1","var2","var3")
        
        for(varname in var.names) {
          #As suggested in an answer to Link 3 above
          #Convert the column name to a 'quote' object
          quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
        
          #Do this for every column name I'll need
          varname <- quote.convert(varname)
          anntot <- quote.convert(paste0(varname,".annual.total"))
          monthly <- quote.convert(paste0(varname,".monthly"))
          rolling <- quote.convert(paste0(varname,".rolling"))
          scaled <- quote.convert(paste0(varname,".scaled"))
        
          #Perform the relevant tasks, using eval()
          #around every variable columnname I may want
          new.table[,paste0(varname,".annual.total"):=
                      the.table[,rep(eval(varname),each=12)]]
          new.table[,paste0(varname,".monthly"):=
                      the.table[,rep(eval(varname)/12,each=12)]]
          new.table[,paste0(varname,".rolling"):=
                      rollapply(eval(monthly),mean,width=12,
                                fill=c(head(eval(monthly),1),
                                       tail(eval(monthly),1)))]
          new.table[,paste0(varname,".scaled"):=
                      eval(anntot)/sum(eval(rolling))*eval(rolling),
                    by=year]
        }
        

        这篇关于如何在 R 中的 data.table 中使用变量中的列名完全通用地工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 16:25