使用data.table（使用fread）快速读取和组合几个文件

本文介绍了使用data.table（使用fread）快速读取和组合几个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有几个不同的txt文件具有相同的结构。现在我想使用fread读它们到R，然后将它们联合成一个更大的数据集。

I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.

## First put all file names into a list
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

，但速度不能令人满意。每个txt文件有1M个观察值和12个字段。

The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.

如果我使用 fread 读取单个文件，速度很快。但是使用 apply ，那么速度是非常慢的，并且显然比读取文件需要更多的时间。我想知道这里的哪里出了问题，速度增益有什么改进吗？

If I use the fread to read a single file, it's fast. But using apply, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?

我尝试了 llply code> plyr 软件包，速度提升不大。

I tried the llply in plyr package, there're not much speed gains.

> data.table 实现 rbind 和 union c> sql ？

Also, is there any syntax in data.table to achieve vertical join like rbind and union in sql?

感谢。

推荐答案

使用 rbindlist （），其设计为 rbind a 列表 .table 在一起...

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

/ strong>说，不要在你的函数的每次迭代中设置键！

And as @Roland says, do not set the key in each iteration of your function!

总之，这是最好的：

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

这篇关于使用data.table（使用fread）快速读取和组合几个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！