读取CSV的一小部分行的最快速方式

本文介绍了读取CSV的一小部分行的最快速方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有2百万行的 5GB csv。标题以逗号分隔 strings ，每行以逗号分隔 doubles ，没有丢失或损坏的数据。它是矩形。

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.

我的目标是尽可能快地将随机的10％（有或没有替换，无关紧要）的行读入RAM 。缓慢的解决方案（但比 read.csv ）的一个例子是用 fread 读取整个矩阵然后保留随机的10％的行。

My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.

require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%

但我正在寻找最快的解决方案（这是慢的，因为我需要先读整个事情，然后修剪它）。

However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).

一个赏金的解决方案将给予系统.time（）不同替代的估计。

The solution deserving of a bounty will give system.time() estimates of different alternatives.

其他：

我使用的是Linux

我不需要10％的行。大约10％。

推荐答案

我认为这应该很快，但让我知道因为我还没有尝试大数据。

I think this should work pretty quickly, but let me know since I have not tried with big data yet.

write.csv(iris,"iris.csv")

fread("shuf -n 5 iris.csv")

    V1  V2  V3  V4  V5         V6
1:  37 5.5 3.5 1.3 0.2     setosa
2:  88 6.3 2.3 4.4 1.3 versicolor
3:  84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1  virginica
5: 114 5.7 2.5 5.0 2.0  virginica

这会为 iris 数据集取随机样本N = 5。

This takes a random sample of N=5 for the iris dataset.

为了避免再次使用标题行，这可能是一个有用的修改：

To avoid the chance of using the header row again, this might be a useful modification:

fread（tail -n + 2 iris.csv | shuf -n 5，header = FALSE）

这篇关于读取CSV的一小部分行的最快速方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！