问题描述
我是Python初学者,并且已经编写了一些基本脚本.我的最新挑战是根据每行中特定变量的值,将一个非常大的csv文件(10gb +)分成多个较小的文件.
I'm a Python beginner, and have made a few basic scripts. My latest challenge is to take a very large csv file (10gb+) and split it into a number of smaller files, based on the value of a particular variable in each row.
例如,文件可能看起来像这样:
For example, the file may look like this:
Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437
我想将文件拆分为单独的文件:Books.csv,Series.csv,Movie.csv
And I would want to split the file into separate files:Books.csv, Series.csv, Movie.csv
实际上,会有数百个类别,并且不会对其进行排序.在这种情况下,它们在第一列中,但将来可能不在.
In reality there will be hundreds of categories, and they will not be sorted. In this case they are in the first column but in future they may not be.
我在网上找到了一些解决方案,但是在Python中却没有.有一个非常简单的AWK命令可以在一行中完成此操作,但是我无法在工作中访问AWK.
I've found a few solutions online but nothing in Python. There is a really simple AWK command that can do this in one line, but I cannot get access to AWK in work.
我编写了以下有效的代码,但我认为它可能效率很低.有人可以建议如何加快速度吗?
I've written the following code which works, but I think it is probably very inefficient. Can anybody suggest how to speed it up?
import csv
#Creates empty set - this will be used to store the values that have already been used
filelist = set()
#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:
#Read the first row of the large file and store the whole row as a string (headerstring)
read_rows = csv.reader(csvfile)
headerrow = next(read_rows)
headerstring=','.join(headerrow)
for row in read_rows:
#Store the whole row as a string (rowstring)
rowstring=','.join(row)
#Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
filename = (row[0])
#This basically makes sure it is not looking at the header row.
if filename != "Category":
#If the filename is not in the filelist set, add it to the list and create new csv file with header row.
if filename not in filelist:
filelist.add(filename)
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(headerstring)
f.write("\n")
f.close()
#If the filename is in the filelist set, append the current row to the existing csv file.
else:
with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
f.write(rowstring)
f.write("\n")
f.close()
谢谢!
推荐答案
一种内存有效的方法,它避免不断重新打开文件以将其追加到此处(只要您不生成大量打开的文件句柄即可) )是使用dict
将类别映射到fileobj.在尚未打开该文件的地方,然后创建它并写入标题,然后始终将所有行写入相应的文件,例如:
A memory efficient way and one that avoids keep re-opening files to append here (as long as you're not going to generate huge amounts of open file handles) is to use a dict
to map the category to a fileobj. Where that file isn't yet opened, then create it and write the header, then always write all rows to the corresponding file, eg:
import csv
with open('somefile.csv') as fin:
csvin = csv.DictReader(fin)
# Category -> open file lookup
outputs = {}
for row in csvin:
cat = row['Category']
# Open a new file and write the header
if cat not in outputs:
fout = open('{}.csv'.format(cat), 'w')
dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
dw.writeheader()
outputs[cat] = fout, dw
# Always write the row
outputs[cat][1].writerow(row)
# Close all the files
for fout, _ in outputs.values():
fout.close()
这篇关于使用Python根据特定列拆分CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!