本文介绍了使用Python将.csv文件分割成块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的.csv文件,超过300 gb。



我写了以下代码:

  import pandas as pd 
df = pd.read_csv('/ path / to / really / big.csv',header = None, chunksize = 100000000)
count = 1
for df:
name ='/output/to/this/directory/file_%s.csv'%s count
chunk .to_csv(name,header = None,index = None)
print(count)
count + = 1

这段代码运行正常,我有足够的内存在磁盘上存储大约5.5-6 gb一次,但它是



有更好的方法吗?



EDIT



我写了以下迭代解决方案:

  with open('/ path / to / really / big.csv' 'r')as csvfile:
read_rows = csv.reader(csvfile)
file_count = 1
row_count = 1
f = open('/ output / to / this / read_rows中的行
f.write(''。join(row))
row_count + = 1
($ file_%s.csv'%s count,'w' if row_count%100000000 == 0:
f.close()
file_count + = 1
f = open('/ output / to / this / directory / file_%s.csv'%s count,'w')

EDIT 2
$ b

我想提请注意Vor关于使用Unix / Linux split命令的注释,这是我发现的最快的解决方案。

解决方案

你不需要读取所有的数据到一个pandas的DataFrame只是为了拆分文件 - 你甚至不需要将数据全部读入内存。您可以寻找要分割的近似偏移量,然后向前扫描,直到找到换行符,并将从源文件读取的更小的块循环到开始和结束偏移量之间的目标文件中。 (此方法假设您的CSV没有任何具有嵌入换行符的列值。)

  SMALL_CHUNK = 100000 

def write_chunk(source_file,start,end,dest_name):
pos = start
source_file.seek(pos)
open(dest_name,'w')as dest_file:
for chunk_start in range(start,end,SMALL_CHUNK):
chunk_end = min(chunk_start + SMALL_CHUNK,end)
dest_file.write(source_file.read(chunk_end - chunk_start))

实际上,中间解决方案可以使用 csv 它仍然会解析文件中的所有行,这不是绝对必要的,但会避免为每个块读取大量的数组到内存中。


I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

I wrote the following code:

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
    name = '/output/to/this/directory/file_%s.csv' %s count
    chunk.to_csv(name,header=None,index=None)
    print(count)
    count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

Is there a better way?

EDIT

I have written the following iterative solution:

with open('/path/to/really/big.csv', 'r') as csvfile:
    read_rows = csv.reader(csvfile)
    file_count = 1
    row_count = 1
    f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
    for row in read_rows:
        f.write(''.join(row))
        row_count+=1
        if row_count % 100000000 == 0:
            f.close()
            file_count += 1
            f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

解决方案

You don't really need to read all that data into a pandas DataFrame just to split the file - you don't even need to read the data all into memory at all. You could seek to the approximate offset you want to split at, then scan forward until you find a line break, and loop reading much smaller chunks from the source file into a destination file between your start and end offsets. (This approach assumes your CSV doesn't have any column values with embedded newlines.)

SMALL_CHUNK = 100000

def write_chunk(source_file, start, end, dest_name):
    pos = start
    source_file.seek(pos)
    with open(dest_name, 'w') as dest_file:
        for chunk_start in range(start, end, SMALL_CHUNK):
            chunk_end = min(chunk_start + SMALL_CHUNK, end)
            dest_file.write(source_file.read(chunk_end - chunk_start))

Actually, an intermediate solution could be to use the csv module - that would still parse all of the lines in the file, which isn't strictly necessary, but would avoid reading huge arrays into memory for each chunk.

这篇关于使用Python将.csv文件分割成块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 00:48