必须ddply使用分割变量的所有可能组合，还是仅观察到?

本文介绍了必须ddply使用分割变量的所有可能组合，还是仅观察到?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个名为thetas的数据框，其中包含大约270万个观测值.

I have a data frame called thetas containing about 2.7 million observations.

> str(thetas)
'data.frame':   2700000 obs. of  8 variables:
 $ rho_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ pct_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sx        : num  1 2 3 4 5 6 7 8 9 10 ...
 $ model     : Factor w/ 7 levels "dN.mN","dN.mL",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ estTheta  : num  -1.58 -1.716 0.504 -2.296 0.98 ...
 $ trueTheta : num  0.0962 -3.3913 3.6006 -0.1971 2.1906 ...
 $ estError  : num  -1.68 1.68 -3.1 -2.1 -1.21 ...
 $ trueAberSx: num  0 0 0 0 0 0 0 0 0 0 ...

我想使用ddply或一些类似的函数来求和估计误差(数据框中的列estError)，但求和在我的模拟的每个条件之内.问题是，我没有一种简单的方法来组合此数据框其他列中的值来唯一标识所有这些条件.更具体地说:列model包含7个可能的值.这些可能值中的三个仅与rho_cnd和pct_cnd中的一个可能值匹配，而其他四个可能值model与rho_cnd和.

I would like to use ddply, or some similar function, to sum the error of estimation (the column estError in my data frame), but where the sums are within each condition of my simulation. The problem is, I don't have a simple way to combine values from the other columns of this data frame to uniquely identify all those conditions. To be more specific: the column model contains 7 possible values. Three of these possible values are only matched up with one possible value in each of rho_cnd and pct_cnd, while the other four possible values of model are matched up with 6 possible pairings of values in rho_cnd and pct_cnd.

我知道，显而易见的解决方案是返回并创建一个变量，该变量唯一地标识我在此处需要标识的所有条件，以便以下代码起作用:

The obvious solution, I know, would be to go back and make a variable that uniquely identifies all the conditions that I would need to identify here, so that the following code would work:

> sums <- ddply(thetas,.(condition1,condition2,etc.),sum(estError))

但是我只是不想返回并重新创建该数据帧的构建方式.现在，我有两个通过分别调用expand.grid进行创建的数据帧，然后对它们进行了rbind排序，并排序以创建一个列出所有有效条件的数据帧，但是即使我保留了这几行代码，我也没有确保如何使用ddply引用它们.我什至不愿使用此解决方案，但如有必要，我会这样做.

But I just don't want to go back and recreate how this data frame is built. Right now I have two data frames created with two separate calls to expand.grid that are then rbinded and sorted to create a data frame listing all valid conditions, but even if I kept those few lines of code in I'm not sure how to reference them with ddply. I would rather not even use this solution, but I will if necessary.

> conditions
   models rhos pcts
1   dN.mN  0.0 0.00
2   dN.mL  0.0 0.00
3   dN.mH  0.0 0.00
4   dL.mN  0.1 0.01
12  dL.mN  0.1 0.02
20  dL.mN  0.1 0.10
8   dL.mN  0.2 0.01
16  dL.mN  0.2 0.02
24  dL.mN  0.2 0.10
5   dL.mL  0.1 0.01
13  dL.mL  0.1 0.02
21  dL.mL  0.1 0.10
9   dL.mL  0.2 0.01
17  dL.mL  0.2 0.02
25  dL.mL  0.2 0.10
6   dH.mN  0.1 0.01
14  dH.mN  0.1 0.02
22  dH.mN  0.1 0.10
10  dH.mN  0.2 0.01
18  dH.mN  0.2 0.02
26  dH.mN  0.2 0.10
7   dH.mH  0.1 0.01
15  dH.mH  0.1 0.02
23  dH.mH  0.1 0.10
11  dH.mH  0.2 0.01
19  dH.mH  0.2 0.02
27  dH.mH  0.2 0.10

是否有任何建议以获得更好的代码和/或更高的效率?谢谢！

Any advice for better code and/or more efficiency? Thanks!

推荐答案

我同意ddply(thetas,.(model,rho_cnd,pct_cnd),...)应该起作用的意见.如果这些变量的某些组合未显示，则ddply(...，.drop = TRUE)将确保未显示的组合不会显示.

I agree with the comment that ddply(thetas,.(model,rho_cnd,pct_cnd),...) should work. If certain combinations of those variables don't show up, ddply(..., .drop=TRUE) will ensure that the unobserved combinations don't show up.

但是，如果您想避免仔细查看一些不存在的组合，可以尝试执行以下操作:

However, if you wanted to avoid ddply looking through some of the non-existant combinations, you could try something like the following:

#newCond <- apply(thetas[,c("model", "rho_cnd", "pct_cnd")], 1, paste, collapse="_")
newCond <- do.call(paste, thetas[,c("model", "rho_cnd", "pct_cnd")], sep="_") #as suggested by baptiste
thetas2 <- cbind(thetas, newCond)

我承认，上面的代码可能对您来说运行缓慢，因此我不确定这不是您想要的.但是从那里您应该可以将ddply()与.variables = newCond一起使用.

I admit, the above code might run slowly for you, so I'm not sure it's what you want. But from there you should be able to use ddply() with .variables=newCond.

此外，由于每个数据子集只返回一个数字，因此可以根据需要使用聚合.

Furthermore, because you're returning only a single number for each subset of the data, you could just use aggregate, if you wanted.

sums <- aggregate(thetas2[,"estError"], by=thetas2[,"newCond"], colSums)

我希望这会有所帮助.

这篇关于必须ddply使用分割变量的所有可能组合，还是仅观察到?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！