我正在寻找一种方法来汇总R中的大型实验结果。由于我需要汇总任意数量的列(无法预先对这些列进行硬编码)并使用任意定义的汇总功能,因此汇总并非一帆风顺。
例如,我有以下平面表my_table
my_table
id_1 id_2 rep_id value_1 value_2
1 a 1 1 0.0 0.0
2 a 1 2 0.2 0.2
3 a 1 3 0.3 0.3
4 a 1 4 0.4 0.4
5 a 1 5 0.1 0.1
6 a 2 1 0.5 0.0
7 a 2 2 1.5 1.5
8 a 2 3 2.5 2.5
9 a 2 4 3.5 3.5
10 a 2 5 4.5 4.5
我将
my_table
汇总到一个表中,例如:> summary_table
id_1 id_2 value_1.min value_1.max value_1.mean_plus_sd value_2.min value_2.max value_2.mean_plus_sd
1 a 1 0.0 0.4 0.3581139 0 0.4 0.3581139
2 a 2 0.5 4.5 4.0811388 0 4.5 4.1464249
总结很复杂,因为我想:
key_fields = c("id_1","id_2")
fields_to_summarize = c("value_1","value_2")
这是我目前用于执行所有这三项操作的代码。很好,但是效率也很低。任何改进将不胜感激:
library(plyr)
# create table
my_table = data.frame("id_1" = c("a","a","a","a","a","a","a","a","a","a")
,"id_2" = c("1","1","1","1","1","2","2","2","2","2")
,"rep_id" = c(1,2,3,4,5,1,2,3,4,5)
,"value_1"= c(0.0,0.2,0.3,0.4,0.1,0.5,1.5,2.5,3.5,4.5)
,"value_2"= c(0.0,0.2,0.3,0.4,0.1,0.0,1.5,2.5,3.5,4.5)
)
# specify columns to group by / summarize over
key_fields = c("id_1","id_2")
fields_to_summarize = c("value_1","value_2")
# create summary_table
counter = 1;
for (fname in fields_to_summarize){
summary_function = function(D) data.frame(setNames(list(min(D[[fname]]),
max(D[[fname]]),
mean(D[[fname]])+sd(D[[fname]])),
paste(fname,c("min",
"max",
"mean_plus_sd"),
sep=".")
))
tmp = ddply(.data = df,
.variable = key_fields,
function(D) summary_function(D))
if (counter == 1){
summary_table = tmp;
} else {
summary_table = join(x=summary_table,y=tmp,by=key_fields,type="left", match="all")
}
counter = counter + 1;
}
最佳答案
不是最终的解决方案,但也许是dplyr
的一个好的开始
library(dplyr)
mean_plus_sd <- function(x) mean(x) + sd(x)
key_fields = c("id_1","id_2")
my_table %>%
group_by_(.dots = key_fields) %>%
summarise_each_(funs(min,max,mean_plus_sd), fields_to_summarize)
关于r - 在R中使用我自己的函数汇总数据帧中的任意列数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/27770736/