问题描述
在我看来,处理 data.table
的行/列子集的最快方法是使用join和 nomatch
选项。
It seems to me the fastest way to do a row/col subset of a data.table
is to use the join and nomatch
option.
DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user system elapsed
# 0.00 0.00 0.01
system.time(subset(DT, (V1==22) & (V2==2)))
# user system elapsed
# 0.45 0.21 0.67
identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE
基于二进制搜索的快速连接也有一个问题:我找不到一种方法来选择一维中的所有项目。
I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.
请问是否要随后执行此操作:
Say if I want to subsequently do:
DT[J(22,2), nomatch=0] # subset on TWO dimensions
DT[J(22,), nomatch=0] # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty
而不必将密钥重新设置为仅一维(因为我处于循环状态,并且我不想每次都放下密钥)。 / p>
without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).
推荐答案
子集 data.table
的最快方法是什么?
使用基于二进制搜索的子集功能最快。请注意,该子集需要选项 nomatch = 0L
,以便仅返回匹配结果。
What's the fastest way to subset a data.table
?
Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L
so as to return only the matching results.
如果您在 DT
上设置了两个键,并且想要通过第一个键 子集,那么您只需在 J(。)
中提供第一个值,否需要提供第二个密钥的任何内容。也就是说:
If you've two keys set on DT
and you want to subset by the first key, then you can just provide the first value in J(.)
, no need to provide anything for the 2nd key. That is:
# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]
如果相反,您想通过第二个键进行子集,那么到目前为止,您必须为第一个键提供所有唯一值。也就是说:
If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:
# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]
这也显示为。尽管我希望 DT [J(,2)]
可以用于这种情况,因为这似乎很直观。
This is also shown in this SO post. Although I'd prefer that DT[J(, 2)]
to work for this case, as that seems rather intuitive.
还有一个待处理的功能请求,用于实现辅助密钥,完成后将解决此问题。
There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.
下面是一个更好的示例:
Here is a better example:
DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
# V1 V2
# 1: 1 2
# 2: 2 3
# 3: 3 2
# 4: 4 3
# 5: 5 2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
# V1 V2
# 1: 1 2
# 2: 2 2
# 3: 3 2
# 4: 4 2
# 5: 5 2
DT[J(unique(V1),2), nomatch=0L]
# V1 V2
# 1: 1 2
# 2: 3 2
# 3: 5 2
DT[J(3), nomatch=0L]
# V1 V2
# 1: 3 2
总结:
# key(DT) = c("V1", "V2")
# data.frame | data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) | DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3)) | DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2)) | DT[J(unique(V1), 2), nomatch=0L]
这篇关于子集数据表的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!