问题描述
当目录中的文件数大于 2.500.000 时,使用 NTFS 和 Windows 7 遍历目录中所有文件的最快方法是什么?所有文件都平放在顶级目录下.
What is the fastest way to iterate over all files in a directory using NTFS and Windows 7, when the filecount in the directory is bigger than 2.500.000?All Files are flat under the top-level directory.
目前我使用
for root, subFolders, files in os.walk(rootdir):
for file in files:
f = os.path.join(root,file)
with open(f) as cf:
[...]
但是它非常非常慢.该进程已经运行了大约一个小时,仍然没有处理一个文件,但仍然以每秒大约 2kB 的内存使用量增长.
but it is very very slow. The process has been running for about an hour and still has not processed a single file but still grows with about 2kB of Memory Usage per second.
推荐答案
默认 os.walk
自底向上遍历目录树.如果你有一棵有很多叶子的深树,我猜这可能会给性能带来惩罚——或者至少是为了增加状态"时间,因为 walk
必须阅读处理第一个文件之前的大量数据.
By default os.walk
walk the directory tree bottom-up. If you have a deep tree with many leafs, I guess this could leave to performances penalties -- or at least for an increased "statup" time, since walk
has to read lots of data before processing the first file.
所有这些都是推测性的,您是否试图强制进行自上而下的探索:
All of this being speculative, have you tried to force a topdown explorations:
for root, subFolders, files in os.walk(rootdir, topdown=True):
...
由于文件似乎在一个平面目录中,可能 glob.iglob
可以通过返回迭代器来获得更好的性能(而其他方法如 os.walk
、os.listdir
或 >glob.glob
首先构建所有文件的 list).你能不能试试这样的:
As the files appear to be in a flat directory, maybe glob.iglob
could leave to better performance by returning an iterator (whereas other method like os.walk
, os.listdir
or glob.glob
build first the list of all files). Could you try something like that:
import glob
# ...
for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
# ...
这篇关于迭代文件夹中的大量文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!