问题描述
要开始,以下是我正在使用的示例数据:
To start, here's example data which I'm working with:
ID BaselineScore MidScore Final Score
1 x NA NA
1 NA y NA
1 NA NA z
2 a NA NA
2 NA b NA
2 NA NA c
我想要完成的是给定的ID(ID == 1,ID == 2等),确定三个分数(基线,中间或最后)中的哪一个最大(即max(x,y,z),max(a,b,c)等)。我有NAs的原因是因为我使用tidyr中的传播
函数(某个时间点的分数变量原来是更一般的分数变量下的行)。
What I'd like to accomplish is for a given ID (ID==1,ID==2, etc.), determine which of the three scores (baseline, mid, or final) is greatest (i.e. max(x,y,z), max(a,b,c), etc.). The reason I have NAs is because I used the spread
function from tidyr (the score variables at a certain time point were originally rows under a more general score variable).
我尝试使用基本的R pmax函数,但只有在列之间具有水平对齐的值时才有效。
I tried used the base R pmax function, but that only works if you have 'horizontally' aligned values between columns.
任何提示?
谢谢,
推荐答案
这是一个使用apply和max的基础解决方案,然后找到最大索引。
Here is a base solution using apply and max and then find the max index.
df <- read.csv(text="ID,BaselineScore,MidScore,Final Score
1,1,NA,NA
1,NA,2,NA
1,NA,NA,3
2,7,NA,NA
2,NA,6,NA
2,NA,NA,5")
fun_base <- function() {
lapply(split(df, df$ID), function(x) {
tmp <- apply(x[-1], 2, max, na.rm=TRUE)
tmp[which.max(tmp)]
})
}
fun_dplyr <- function() {
df %>%
gather(Score_type, Score, -ID) %>%
group_by(ID) %>%
filter(Score==max(Score, na.rm=TRUE))
}
microbenchmark(
fun_base(),
fun_dplyr(),
times=50L)
#Unit: microseconds
# expr min lq mean median uq max neval
# fun_base() 590.6 666.6 728.842 709.85 789.1 1060.1 50
# fun_dplyr() 2110.3 2318.3 2533.324 2442.75 2639.5 3663.4 50
这篇关于通过多个时间点找到给定受试者的分数的最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!