我正在寻找一种方法来汇总R中的大型实验结果。由于我需要汇总任意数量的列(无法预先对这些列进行硬编码)并使用任意定义的汇总功能,因此汇总并非一帆风顺。

例如,我有以下平面表my_table

my_table
   id_1 id_2 rep_id value_1 value_2
1     a    1      1     0.0     0.0
2     a    1      2     0.2     0.2
3     a    1      3     0.3     0.3
4     a    1      4     0.4     0.4
5     a    1      5     0.1     0.1
6     a    2      1     0.5     0.0
7     a    2      2     1.5     1.5
8     a    2      3     2.5     2.5
9     a    2      4     3.5     3.5
10    a    2      5     4.5     4.5

我将my_table汇总到一个表中,例如:
> summary_table
  id_1 id_2 value_1.min value_1.max value_1.mean_plus_sd value_2.min value_2.max value_2.mean_plus_sd
1    a    1         0.0         0.4            0.3581139           0         0.4            0.3581139
2    a    2         0.5         4.5            4.0811388           0         4.5            4.1464249

总结很复杂,因为我想:
  • 指定要分组的变量,例如key_fields = c("id_1","id_2")
  • 指定要汇总的列,例如fields_to_summarize = c("value_1","value_2")
  • 使用我自己的汇总功能(也为新列命名)

  • 这是我目前用于执行所有这三项操作的代码。很好,但是效率也很低。任何改进将不胜感激:
    library(plyr)
    
    # create table
    my_table = data.frame("id_1"  = c("a","a","a","a","a","a","a","a","a","a")
                        ,"id_2" = c("1","1","1","1","1","2","2","2","2","2")
                        ,"rep_id" = c(1,2,3,4,5,1,2,3,4,5)
                        ,"value_1"= c(0.0,0.2,0.3,0.4,0.1,0.5,1.5,2.5,3.5,4.5)
                        ,"value_2"= c(0.0,0.2,0.3,0.4,0.1,0.0,1.5,2.5,3.5,4.5)
        )
    
    # specify columns to group by / summarize over
    key_fields = c("id_1","id_2")
    fields_to_summarize = c("value_1","value_2")
    
    # create summary_table
    counter = 1;
    for (fname in fields_to_summarize){
    
      summary_function = function(D) data.frame(setNames(list(min(D[[fname]]),
                                                              max(D[[fname]]),
                                                              mean(D[[fname]])+sd(D[[fname]])),
                                                         paste(fname,c("min",
                                                                       "max",
                                                                       "mean_plus_sd"),
                                                               sep=".")
      ))
    
      tmp = ddply(.data = df,
                     .variable = key_fields,
                     function(D) summary_function(D))
    
      if (counter == 1){
        summary_table = tmp;
      } else {
        summary_table = join(x=summary_table,y=tmp,by=key_fields,type="left", match="all")
      }
      counter = counter + 1;
    }
    

    最佳答案

    不是最终的解决方案,但也许是dplyr的一个好的开始

    library(dplyr)
    
    mean_plus_sd <- function(x) mean(x) + sd(x)
    key_fields = c("id_1","id_2")
    
    my_table %>%
      group_by_(.dots = key_fields) %>%
      summarise_each_(funs(min,max,mean_plus_sd), fields_to_summarize)
    

    关于r - 在R中使用我自己的函数汇总数据帧中的任意列数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27770736/

    10-12 19:58