问题描述
我有几个不同的txt文件具有相同的结构。现在我想使用fread读它们到R,然后将它们联合成一个更大的数据集。
I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.
## First put all file names into a list
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")
## Read data using fread
readdata <- function(fn){
dt_temp <- fread(fn, sep=",")
keycols <- c("ID", "date")
setkeyv(dt_temp,keycols) # Notice there's a "v" after setkey with multiple keys
return(dt_temp)
}
# then using
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)
,但速度不能令人满意。每个txt文件有1M个观察值和12个字段。
The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.
如果我使用 fread
读取单个文件,速度很快。但是使用 apply
,那么速度是非常慢的,并且显然比读取文件需要更多的时间。我想知道这里的哪里出了问题,速度增益有什么改进吗?
If I use the fread
to read a single file, it's fast. But using apply
, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?
我尝试了 llply
code> plyr 软件包,速度提升不大。
I tried the llply
in plyr
package, there're not much speed gains.
> data.table 实现 rbind
和 union
c> sql ?
Also, is there any syntax in data.table
to achieve vertical join like rbind
and union
in sql
?
感谢。
推荐答案
使用 rbindlist ()
,其设计为 rbind
a 列表
.table 在一起...
Use rbindlist()
which is designed to rbind
a list
of data.table
's together...
mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )
/ strong>说,不要在你的函数的每次迭代中设置键!
And as @Roland says, do not set the key in each iteration of your function!
总之,这是最好的:
l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )
这篇关于使用data.table(使用fread)快速读取和组合几个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!