问题描述
假设我们有一个包含3列的数据框,分别代表3种不同的情况,状态可以分别为0或1。第四列包含度量。
Say we've got a dataframe with 3 columns representing 3 different cases, and each can be of state 0 or 1. A fourth column contains a measurement.
set.seed(123)
df <- data.frame(round(runif(25)),
round(runif(25)),
round(runif(25)),
runif(25))
colnames(df) <- c("V1", "V2", "V3", "x")
head(df)
V1 V2 V3 x
1 0 1 0 0.2201189
2 1 1 0 0.3798165
3 0 1 1 0.6127710
aggregate(df$x, by=list(df$V1, df$V2, df$V3), FUN=mean)
Group.1 Group.2 Group.3 x
1 0 0 0 0.1028646
2 1 0 0 0.5081943
3 0 1 0 0.4828984
4 1 1 0 0.5197925
5 0 0 1 0.4571073
6 1 0 1 0.3219217
7 0 1 1 0.6127710
8 1 1 1 0.6029213
集合函数计算所有可能组合的平均值。但是,在我的研究中,我还需要知道某些列可能具有任何状态的组合的结果。例如,所有观测值的平均值为V1 == 1& V2 == 1,与V3的内容无关。结果应该看起来像这样,星号表示不在乎:
The aggregate function calculates the mean for all possible combinations. However, in my research I also need to know the outcome of combinations, where certain columns may have any state. For example, the mean of all observations with V1==1 & V2==1, regardless the contents of V3. The result should look like this, with the asterisk representing "don't care":
Group.1 Group.2 Group.3 x
1 * * * 0.1234567 (this is the mean of all rows)
2 0 * * 0.1234567
3 1 * * 0.1234567
4 * 0 * 0.1224567
5 * 1 * 0.1234567
[ all other possible combinations follow, should be total of 27 rows ]
是否有简单的方法来实现这一目标?
Is there a easy way to achieve this?
推荐答案
这是 ldply
- ddply
方法:
library(plyr)
ldply(list(.(V1,V2,V3),.(V1),.(V2),.()), function(y) ddply(df,y,summarise,x=mean(x)))
V1 V2 V3 x .id
1 0 0 0 0.1028646 <NA>
2 0 0 1 0.4571073 <NA>
3 0 1 0 0.4828984 <NA>
4 0 1 1 0.6127710 <NA>
5 1 0 0 0.5081943 <NA>
6 1 0 1 0.3219217 <NA>
7 1 1 0 0.5197925 <NA>
8 1 1 1 0.6029213 <NA>
9 0 NA NA 0.4436400 <NA>
10 1 NA NA 0.4639997 <NA>
11 NA 0 NA 0.4118793 <NA>
12 NA 1 NA 0.5362985 <NA>
13 NA NA NA 0.4566702 <NA>
基本上,您会创建所有感兴趣的变量组合的列表,并使用 ldply
并使用 ddply
进行聚集。 plyr的神奇之处在于,可以将所有内容放入一个紧凑的数据框中。剩下的就是删除由均值(。()
)引入的虚假 .id
列并替换如果需要,在带有 *
的组中的 NA
s。
Essentially you create a list of all your variable combinations you are interested in, and iterate over this with ldply
and using ddply
to perform the aggreation. The magic of plyr puts it all into a compact dataframe for you. All that remains is to remove the spurious .id
column introduced by the grand mean (.()
) and to replace the NA
s in the groups with "*"
if needed.
要获取所有组合,可以使用 combn
和 lapply
生成带有相关组合的列表放入 ldply
:
To get all combinations you can use combn
and lapply
to generate a list with the relevant combinations to plug into ldply
:
all.combs <- unlist(lapply(0:3,combn,x=c("V1","V2","V3"),simplify=FALSE),recursive=FALSE)
ldply(all.combs, function(y) ddply(df,y,summarise,x=mean(x)))
.id x V1 V2 V3
1 <NA> 0.4566702 NA NA NA
2 <NA> 0.4436400 0 NA NA
3 <NA> 0.4639997 1 NA NA
4 <NA> 0.4118793 NA 0 NA
5 <NA> 0.5362985 NA 1 NA
6 <NA> 0.4738541 NA NA 0
7 <NA> 0.4380543 NA NA 1
8 <NA> 0.3862588 0 0 NA
9 <NA> 0.5153666 0 1 NA
10 <NA> 0.4235250 1 0 NA
11 <NA> 0.5530440 1 1 NA
12 <NA> 0.3878900 0 NA 0
13 <NA> 0.4882400 0 NA 1
14 <NA> 0.5120604 1 NA 0
15 <NA> 0.4022073 1 NA 1
16 <NA> 0.4502901 NA 0 0
17 <NA> 0.3820042 NA 0 1
18 <NA> 0.5013455 NA 1 0
19 <NA> 0.6062045 NA 1 1
20 <NA> 0.1028646 0 0 0
21 <NA> 0.4571073 0 0 1
22 <NA> 0.4828984 0 1 0
23 <NA> 0.6127710 0 1 1
24 <NA> 0.5081943 1 0 0
25 <NA> 0.3219217 1 0 1
26 <NA> 0.5197925 1 1 0
27 <NA> 0.6029213 1 1 1
这篇关于R汇总所有可能的组合,包括“不在乎”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!