


I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.


paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt


Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()


time ./read.py largefile.txt
real    0m55.746s


However it is possible to read the whole file much faster as:

time cut -f1 largefile.txt > /dev/null
real    0m1.702s


One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?


Update. The fake data was not very good as it had only one value per key. Better is

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

几年前,蒂姆·布雷(Tim Bray)的网站上发表了一篇博客文章"Wide Finder Project",涉及范围广泛[1].您可以从ElementTree [3]和PIL [4]的名声中找到Fredrik Lundh的解决方案[2].我知道通常不建议在此站点发布链接,但是我认为这些链接比复制粘贴他的代码给您更好的答案.

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

[1] http://www.tbray .org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

08-03 21:58