问题描述
我正在选择data.frame g.raw
的子集,如下所示:
I am selecting a subset of a data.frame g.raw
, like this:
g.raw <- read.table(gfile,sep=',', header=F, row.names=1)
snps = intersect(row.names(na.omit(csnp.raw)),row.names(na.omit(esnp.raw)))
g = g.raw[snps,]
它可以工作。然而,最后一行是非常缓慢。
It works. However, that last line is EXTREMELY slow.
g.raw
约18M行, snps
约1M。我意识到这些都是相当大的数字,但这似乎是一个简单的操作,并且将g读入内存中的一个矩阵/数据框不是问题(花了几分钟),而我上面描述的这个操作正在采取小时。
g.raw
is about 18M rows and snps
is about 1M. I realize these are pretty large numbers, but this seems like a simple operation, and reading in g into a matrix/data.frame held in memory wasn't a problem (took a few minutes), whereas this operation I described above is taking hours.
我该如何加速?所有我想要收缩g.raw很多。
How do I speed this up? All I want is to shrink g.raw a lot.
谢谢!
推荐答案
似乎是 data.table
可以闪耀的情况。
It seems to be the case where data.table
can shine.
复制 data.frame
:
set.seed(1)
N <- 1e6 # total number of rows
M <- 1e5 # number of rows to subset
g.raw <- data.frame(sample(1:N, N), sample(1:N, N), sample(1:N, N))
rownames(g.raw) <- sapply(1:N, function(x) paste(sample(letters, 50, replace=T), collapse=""))
snps <- sample(rownames(g.raw), M)
head(g.raw) # looking into newly created data.frame
head(snps) # and rows for subsetting
数据。框架
方法:
system.time(g <- g.raw[snps,])
# > user system elapsed
# > 881.039 0.388 884.821
data.table
require(data.table)
dt.raw <- as.data.table(g.raw, keep.rownames=T)
# rn is a column with rownames(g.raw)
system.time(setkey(dt.raw, rn))
# > user system elapsed
# > 8.029 0.004 8.046
system.time(dt <- dt.raw[snps,])
# > user system elapsed
# > 0.428 0.000 0.429
嗯,这些 N
和 M
(甚至更好的加速与更大的 N
)。
Well, 100x times faster with these N
and M
(and even better speed-up with larger N
).
您可以比较结果:
head(g)
head(dt)
这篇关于R真的很慢矩阵/ data.frame索引选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!