本文介绍了R data.table-将函数A应用于某些列,将函数B应用于其他列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想聚合数据表的行,但是聚集功能取决于列的名称。

I want to aggregate datatable's row, but the aggragation function depends on the name of the column.

例如,如果列名称为:


  • variable1 variable2 ,然后应用 mean()函数。

  • variable3 ,然后应用 max()函数。

  • 变量4 ,然后应用 sd()函数。

  • variable1 or variable2, then apply the mean() function.
  • variable3, then apply the max() function.
  • variable4, then apply the sd() function.

我的数据表始终具有 datetime 列:我想按时间汇总行。
但是,数据列的数量可以变化。

My datatables always have a datetime column: I want to aggregate rows by time.However, the number of "data" column can vary.

我知道如何使用相同的聚合函数(例如所有列的均值()):

I know how to do that with the same aggregation function (e.g. mean()) for all columns:

dt <- dt[, lapply(.SD, mean),
           by = .(datetime = floor_date(datetime, timeStep))]

或仅针对列的子集:

cols <- c("variable1", "variable2")
dt <- dt[ ,(cols) := lapply(.SD, mean),
            by = .(datetime = floor_date(datetime, timeStep)),
            .SDcols = cols]

我想做的事情是:

colsToMean <- c("variable1", "variable2")
colsToMax <- c("variable3")
colsToSd <- c("variable4")
dt <- dt[ ,{(colsToMean) := lapply(.SD???, mean),
             (colsToMax) := lapply(.SD???, max),
             (colsToSd) :=  lapply(.SD???, sd)},
            by = .(datetime = floor_date(datetime, timeStep)),
            .SDcols = (colsToMean, colsToMax, colsToSd)]

我查看了,这使我有了使用自定义函数的想法:

I looked at data.table in R - apply multiple functions to multiple columns which gave me the idea to use a custom function:

myAggregate <- function(x, columnName) {
   FUN = getAggregateFunction(columnName) # Return mean() or max() or sd()
   return FUN(x)
}
dt <- dt[, lapply(.SD, myAggregate, ???columName???),
           by = .(datetime = floor_date(datetime, timeStep))]

但是我不知道如何将当前列名传递给 myAggregate() ...

But I don't know how to pass the current column name to myAggregate()...

推荐答案

这是使用 Map 或 mapply

让我们先制作一些玩具数据:

Let's make some toy data first:

dt <- data.table(
    variable1 = rnorm(100),
    variable2 = rnorm(100),
    variable3 = rnorm(100),
    variable4 = rnorm(100),
    grp = sample(letters[1:5], 100, replace = T)
)

colsToMean <- c("variable1", "variable2")
colsToMax <- c("variable3")
colsToSd <- c("variable4")

然后,

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c(mean, max, sd), lengths(scols))

# summary
dt[, Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

# or replace the original values with summary statistics as in OP
dt[, unlist(scols) := Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

GForce的另一种选择是:

Another option with GForce on:

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c('mean', 'max', 'sd'), lengths(scols))

jexp <- paste0('list(', paste0(funs, '(', unlist(scols), ')', collapse = ', '), ')')
dt[, eval(parse(text = jexp)), by = grp, verbose = TRUE]

# Detected that j uses these columns: variable1,variable2,variable3,variable4
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# Getting back original order ... 0.000sec
# lapply optimization is on, j unchanged as 'list(mean(variable1), mean(variable2), max(variable3), sd(variable4))'
# GForce optimized j to 'list(gmean(variable1), gmean(variable2), gmax(variable3), gsd(variable4))'
# Making each group and running j (GForce TRUE) ... 0.000sec

这篇关于R data.table-将函数A应用于某些列,将函数B应用于其他列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 15:01