问题描述
我分析了一些我用 cProfile 继承的遗留代码.我已经做了很多有帮助的改变(比如使用 simplejson 的 C 扩展!).
I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).
基本上,此脚本将数据从一个系统导出到 ASCII 固定宽度文件.每一行都是一条记录,它有很多值.每行有 7158 个字符,包含大量空格.总记录数为 150 万条记录.每行一次生成一个,需要一段时间(每秒 5-10 行).
Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).
随着每一行的生成,它会尽可能简单地写入磁盘.分析表明,大约 19-20% 的总时间花在 file.write()
上.对于 1,500 行的测试用例,即 20 秒.我想减少这个数字.
As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write()
. For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.
现在看来下一个胜利将是减少写入磁盘的时间.如果可能的话,我想减少它.我可以在内存中保存一个记录缓存,但我不能等到最后一次性全部转储.
Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.
fd = open(data_file, 'w')
for c, (recordid, values) in enumerate(generatevalues()):
row = prep_row(recordid, values)
fd.write(row)
if c % 117 == 0:
if limit > 0 and c >= limit:
break
sys.stdout.write('
%s @ %s' % (str(c + 1).rjust(7), datetime.now()))
sys.stdout.flush()
我的第一个想法是将记录缓存在列表中并分批写出.那会更快吗?类似的东西:
My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:
rows = []
for c, (recordid, values) in enumerate(generatevalues()):
rows.append(prep_row(recordid, values))
if c % 117 == 0:
fd.write('
'.join(rows))
rows = []
我的第二个想法是使用另一个线程,但这让我想死在里面.
My second thought would be to use another thread, but that makes me want to die inside.
推荐答案
将写入分批写入 500 组确实显着加快了写入速度.对于这个测试用例,单独写入行在 I/O 中需要 21.051 秒,而批量写入 117 需要 5.685 秒来写入相同数量的行.500 个批次总共只需要 0.266 秒.
Batching the writes into groups of 500 did indeed speed up the writes significantly. For this test case the writing rows individually took 21.051 seconds in I/O, while writing in batches of 117 took 5.685 seconds to write the same number of rows. Batches of 500 took a total of only 0.266 seconds.
这篇关于加速写入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!