

我有一个具有2百万行的 5GB csv。标题以逗号分隔 strings ,每行以逗号分隔 doubles ,没有丢失或损坏的数据。它是矩形。

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.

我的目标是尽可能快地将随机的10%(有或没有替换,无关紧要)的行读入RAM 。缓慢的解决方案(但比 read.csv )的一个例子是用 fread 读取整个矩阵然后保留随机的10%的行。

My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.

X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%


However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).


The solution deserving of a bounty will give system.time() estimates of different alternatives.


  • 我使用的是Linux

  • 我不需要10%的行。大约10%。



I think this should work pretty quickly, but let me know since I have not tried with big data yet.


fread("shuf -n 5 iris.csv")

    V1  V2  V3  V4  V5         V6
1:  37 5.5 3.5 1.3 0.2     setosa
2:  88 6.3 2.3 4.4 1.3 versicolor
3:  84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1  virginica
5: 114 5.7 2.5 5.0 2.0  virginica

这会为 iris 数据集取随机样本N = 5。

This takes a random sample of N=5 for the iris dataset.


To avoid the chance of using the header row again, this might be a useful modification:

fread(tail -n + 2 iris.csv | shuf -n 5,header = FALSE)


08-27 06:51