我有一个大约15gb的大型csv文件,有180万列和5k行。我需要对一个文件进行转置,或者如果有一种有效的方法可以逐列读取文件。在Python2.7、bash或Matlab中寻找节省时间和内存的解决方案。

CSV structure:

column names increment from f0,f1 to f1800000
each row has 1.8 million enteries with value of either 0 or 1.


---------------------------------------
 f0,f1,f2    .........    ,f1800000
---------------------------------------

 0,0,1       .........    ,0
 1,0,1       .........    ,1

 .........
---------------------------------------

最佳答案

下面是一个有效的方法,使用pandas,按小批量处理行:

import pandas as pd
NCOLS = 1.8e6  # The exact number of columns

batch_size = 50
from_file = 'my_large_file.csv'
to_file = 'my_large_file_transposed.csv'
for batch in range(NCOLS//batch_size + bool(NCOLS%batch_size)):
    lcol = batch * batch_size
    rcol = min(NCOLS, lcol+batch_size)
    data = pd.read_csv(from_file, usecols=range(lcol, rcol))
    with open(to_file, 'a') as _f:
        data.T.to_csv(_f, header=False)

08-27 08:45