问题描述
我在R中有:
library(data.table)
$ p>
set.seed(1234)
DT< - data。 (x = rep(c(1,2,3),each = 4),y = c(A,B),v = sample(1:100,12))
DT
xyv
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
我可以通过data.table中的组轻松求和变量v:
out< - DT [,list(SUM = sum(v)),by = list(x,y)]
out
xy SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[ 6,] 3 B 96
但是,我想将组而不是行。我可以使用
reshape
:out,direction ='wide',idvar ='x',timevar ='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
更有效的方式重整数据后聚合它?有没有办法使用data.table操作将这些操作合并为一个步骤?解决方案
data.table
包实现更快melt / dcast
函数(在C中)。它还具有其他功能,允许融化和铸造多个列 。请参阅Github上的新。 p>
熔化/ dcast函数for data.table自v1.9.0起已可用,其功能包括:
-
在投放之前,不需要加载
reshape2
包。但是如果您希望加载其他操作,请在加载data.table
之前加载。 -
dcast
也是一个S3通用。没有更多dcast.data.table()
。只需使用dcast()
-
>:
-
能够在list类型的列上融化。
-
获得
variable.factor
和value.factor
,默认情况下TRUE
和FALSE
,以与reshape2
兼容。这允许直接控制variable
和value
列的输出类型(作为因子或不是因子)。 -
melt.data.table
的na.rm = TRUE
参数在内部进行优化,以便在熔化过程中直接删除NA,因此效率更高。 -
NEW:
measure.vars
的列表,并且列表中每个元素中指定的列将合并在一起。这通过使用patterns()
进一步实现。
-
-
dcast
:
-
接受多个
fun.aggregate
和多个value.var
。请参阅vignette或?dcast
。 -
使用
rowid c $ c>函数直接在公式中生成id列,有时需要唯一标识行。
-
-
旧基准:
-
melt
:1000万行和5列,61.3秒减少到1.2秒。 -
dcast
:100万行4列,192秒减少到3.6秒。
-
科隆提醒(2013年12月)简报幻灯片32:
I have a data table in R:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
x y v
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
I can easily sum the variable v by the groups in the data.table:
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
x y SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B 96
However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape
:
out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.melt
:is capable of melting on columns of type 'list'.
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.
dcast
:accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
这篇关于适当/最快的方式重塑数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!