本文介绍了用tapply按组求和多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按组对各个列进行求和,而我的第一个想法是使用tapply.但是,我无法使tapply正常工作.可以使用tapply求和多列吗?如果没有,为什么不呢?

I wanted to sum individual columns by group and my first thought was to use tapply.However, I cannot get tapply to work. Can tapply be used to sum multiple columns?If not, why not?

我在互联网上进行了广泛搜索,发现张贴了许多类似的问题最早可以追溯到2008年.但是,这些问题都没有直接得到回答.相反,响应总是建议使用其他函数.

I have searched the internet extensively and found numerous similar questions postedas far back as 2008. However, none of those questions have been answered directly.Instead, the responses invariably suggest using a different function.

以下是我希望按州对苹果,按州对樱桃求和的示例数据集和李子按州.在此之下,我为tapply编写了许多替代方案,工作.

Below is an example data set for which I wish to sum apples by state, cherries by stateand plums by state. Below that I have compiled numerous alternatives to tapply thatdo work.

在底部,我显示了对tapply源代码的简单修改,该修改允许tapply执行所需的操作.

At the bottom I show a simple modification to the tapply source code that allowstapply to perform the desired operation.

尽管如此,也许我忽略了一种执行所需操作的简单方法用tapply.我没有在寻找替代功能,尽管欢迎其他替代功能.

Nevertheless, perhaps I am overlooking a simple way to perform the desired operationwith tapply. I am not looking for alternative functions, although additional alternatives are welcome.

鉴于对tapply源代码的修改很简单,所以我想知道为什么,或者类似的东西尚未实施.

Given the simplicity of my modification to the tapply source code I wonder why it, orsomething similar, has not already been implemented.

谢谢您的任何建议.如果我的问题是重复的,我将很乐意张贴我的问题作为对其他问题的答案.

Thank you for any advice. If my question is a duplicate I will be happy to post myquestion as an answer to that other question.

这是示例数据集:

df.1 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        2       10         20      30
       AA        3      100        200     300
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

这不起作用:

tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})

帮助页面上显示:

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

X       an atomic object, typically a vector.

我对短语typically a vector感到困惑,这让我怀疑可以使用一个数据帧.我还不清楚atomic object是什么意思.

I was confused by the phrase typically a vector which made me wonder whethera data frame could be used. I have never been clear on what atomic object means.

这是tapply的几种可行的替代方法.第一种选择是将tapplyapply组合在一起的解决方法.

Here are several alternatives to tapply that do work. The first alternative is a work-around that combines tapply with apply.

apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), colSums))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

aggregate(df.1[,3:5], by=list(df.1$state), sum)

#   Group.1 apples cherries plums
# 1      AA    111      222   333
# 2      BB   -111     -222  -333

by(df.1[,3:5], df.1$state, colSums)

# df.1$state: AA
#   apples cherries    plums
#      111      222      333
# ------------------------------------------------------------
# df.1$state: BB
#   apples cherries    plums
#     -111     -222     -333

with(df.1,
     aggregate(x = list(apples   = apples,
                        cherries = cherries,
                        plums    = plums),
               by = list(state   = state),
               FUN = function(x) sum(x)))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )

# $AA
#   apples cherries    plums
#      111      222      333
#
# $BB
#   apples cherries    plums
#     -111     -222     -333

这是tapply的源代码,除了我更改了该行:

Here is the source code for tapply except that I changed the line:

nx <- length(X)

收件人:

nx <- ifelse(is.vector(X), length(X), dim(X)[1])

此修改后的tapply版本执行所需的操作:

This modified version of tapply performs the desired operation:

my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    if (!nI) stop("'INDEX' is of length zero")
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx     <- ifelse(is.vector(X), length(X), dim(X)[1])  # replaces nx <- length(X)
    one <- 1L
    group <- rep.int(one, nx) #- to contain the splitting vector
    ngroup <- one
    for (i in seq_along(INDEX)) {
    index <- as.factor(INDEX[[i]])
    if (length(index) != nx)
        stop("arguments must have same length")
    namelist[[i]] <- levels(index)#- all of them, yes !
    extent[i] <- nlevels(index)
    group <- group + ngroup * (as.integer(index) - one)
    ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(X = split(X, group), FUN = FUN, ...)
    index <- as.integer(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
    ansmat <- array(dim = extent, dimnames = namelist)
    ans <- unlist(ans, recursive = FALSE)
    } else {
    ansmat <- array(vector("list", prod(extent)),
            dim = extent, dimnames = namelist)
    }
    if(length(index)) {
        names(ans) <- NULL
        ansmat[index] <- ans
    }
    ansmat
}

my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})

#  AA   BB
# 111 -111

my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})

# $AA
#   apples cherries
#      111      222
#
# $BB
#   apples cherries
#     -111     -222

推荐答案

tapply适用于矢量,对于data.frame,您可以使用by(它是tapply的包装,请看一下代码):

tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):

> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
  apples cherries    plums
     111      222      333
-------------------------------------------------------------------------------------
df.1$state: BB
  apples cherries    plums
    -111     -222     -333

这篇关于用tapply按组求和多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 12:35