问题描述
我有一个具有2百万行的 5GB
csv。标题以逗号分隔 strings
,每行以逗号分隔 doubles
,没有丢失或损坏的数据。它是矩形。
I have a 5GB
csv with 2 million rows. The header are comma separated strings
and each row are comma separated doubles
with no missing or corrupted data. It is rectangular.
我的目标是尽可能快地将随机的10%(有或没有替换,无关紧要)的行读入RAM 。缓慢的解决方案(但比 read.csv
)的一个例子是用 fread
读取整个矩阵然后保留随机的10%的行。
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv
) is to read in the whole matrix with fread
and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
但我正在寻找最快的解决方案(这是慢的,因为我需要先读整个事情,然后修剪它)。
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
一个赏金的解决方案将给予系统.time()
不同替代的估计。
The solution deserving of a bounty will give system.time()
estimates of different alternatives.
其他:
- 我使用的是Linux
- 我不需要10%的行。大约10%。
推荐答案
我认为这应该很快,但让我知道因为我还没有尝试大数据。
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
这会为 iris
数据集取随机样本N = 5。
This takes a random sample of N=5 for the iris
dataset.
为了避免再次使用标题行,这可能是一个有用的修改:
To avoid the chance of using the header row again, this might be a useful modification:
fread(tail -n + 2 iris.csv | shuf -n 5,header = FALSE)
这篇关于读取CSV的一小部分行的最快速方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!