问题描述
我有一个大数据框(大约几 GB),我想将其转换为 data.table
.使用 as.data.table
创建数据帧的副本,这意味着我需要可用内存至少是数据大小的两倍.有没有办法在没有副本的情况下进行转换?
I have a large data frame (in the order of several GB) that I'd like to convert to a data.table
. Using as.data.table
creates a copy of the data frame, which means I need available memory to be at least twice the size of the data. Is there a way to do the conversion without a copy?
这里有一个简单的例子来演示:
Here's a simple example to demonstrate:
library(data.table)
N <- 1e6
K <- 1e2
data <- as.data.frame(rep(data.frame(rnorm(N)), K))
gc(reset=TRUE)
tracemem(data)
data <- as.data.table(data)
gc()
有输出:
library(data.table)
# data.table 1.8.10 For help type: help("data.table")
N <- 1e6
K <- 1e2
data <- as.data.frame(rep(data.frame(rnorm(N)), K))
gc(reset=TRUE)
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 303759 16.3 597831 32.0 303759 16.3
# Vcells 100442572 766.4 402928632 3074.2 100442572 766.4
tracemem(data)
# [1] "<0x363fda0>"
data <- as.data.table(data)
# tracemem[0x363fda0 -> 0x31e4260]: copy as.data.table.data.frame as.data.table
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 304519 16.3 597831 32.0 306162 16.4
# Vcells 100444242 766.4 322342905 2459.3 200933219 1533.0
推荐答案
o 关注 此 S.O.post,现在实现了一个函数 setDT
,它采用 list
(命名和/或未命名)、data.frame
(或data.table
) 作为输入并返回与 data.table
by reference 相同的对象(没有任何副本).请参阅 ?setDT
示例了解更多信息.
这符合 data.table
命名约定 - 所有 set*
函数都通过引用进行修改.:=
是唯一也通过引用进行修改的另一个.
This is in accordance with data.table
naming convention - all set*
functions modifies by reference. :=
is the only other that also modifies by reference.
require(data.table) # v1.9.0+
setDT(data) # converts data which is a data.frame to data.table *by reference*
查看历史以获取较早(现已过时)的答案.
See history for older (now outdated) answers.
这篇关于将数据框转换为 data.table 而无需复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!