问题描述
我最近使用了更大的数据集,并开始学习和迁移到data.table以提高聚合/分组的性能。我已经无法获得某些表达式或函数按预期分组。
I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
如果我想简单地按类别计算每个组的平均值。这很容易工作。
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
如果我尝试使用scale函数或者甚至一个简单的表达式减去自己的值,我遇到问题。分组被忽略,我得到应用于每行的函数/表达式。以下按类别返回所有100行,而不是10个组。
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
$ b b
我认为将scale作为返回数值向量而不是矩阵的函数可能会有帮助。
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
这也返回应用于所有行(N = 100)并忽略分组的zScore函数。我为什么得到scale()或一个自定义函数使用分组,就像上面使用mean()时,我缺少什么?
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
推荐答案
您已经在评论中澄清了您想要的行为类似:
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
$ b
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
您提供的数据表选项给出:
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
这是完全相同的数据。但是,您还想在结果中重复 value
列,并用更具描述性的内容重命名 V1
变量。 data.table给出了结果中的分组变量,以及您提供的表达式的结果。因此,我们修改它以提供您想要的行:
Which is exactly the same data. However you'd like to also repeat the value
column in your result, and rename the V1
variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
您的
dt[,zScore(value),by="category"]
p>
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
列表中命名项目成为结果中的列。
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(请注意,我将您的ddply data.frame结果转换为data.table, code> same 命令工作)。
(note I converted your ddply data.frame result into a data.table, to allow the identical
command to work).
这篇关于返回向量的data.table聚合,例如scale()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!