问题描述
我的目录中包含大量文件(〜1mil).我需要从该目录中选择一个随机文件.由于文件太多,因此os.listdir
自然要花很长时间才能完成.
I have a directory with a large number of files (~1mil). I need to choose a random file from this directory. Since there are so many files, os.listdir
naturally takes an eternity to finish.
有没有办法可以解决这个问题?也许以某种方式知道目录中的文件数量(未列出),然后选择第n个随机生成n的文件?
Is there a way I can circumvent this problem? Maybe somehow get to know the number of files in the directory (without listing it) and choose the 'n'th file where n is randomly generated?
目录中的文件是随机命名的.
The files in the directory are randomly named.
推荐答案
A,我认为您的问题没有解决方案.第一,我不知道可移植的API将返回您目录中的条目数(不先枚举它们).第二,我认为没有API可以按编号而不是按名称返回目录条目.
Alas, I don't think there is a solution to your problem. One, I don't know of portable API that will return you the number of entries in directory (w/o enumerating them first). Two, I don't think there is API to return you directory entry by number and not by name.
因此,总的来说,程序必须枚举O(n)目录条目才能获得单个随机条目.确定条目数然后选择一个条目的简单方法将需要足够的RAM来保存完整列表(os.listdir()
),或者必须枚举目录第二次才能找到random(n)项-总体n+n/2
操作平均.
So overall, a program will have to enumerate O(n) directory entries to get a single random one. The trivial approach of determining number of entries and then picking one will either require enough RAM to hold the full listing (os.listdir()
) or will have to enumerate 2nd time the directory to find the random(n) item - overall n+n/2
operations on average.
有一个更好的方法-但只有一点-请参见从文件中随机选择行.简而言之,有一种方法可以从列表/迭代器中选择长度未知的随机项目,同时一次读取一个项目,并确保可以以相同的概率选择任何项目.但这对os.listdir()
无济于事,因为它已经在已经包含所有1M +条目的内存中返回了list
-因此您也可以询问有关len()
...
There is slightly better approach - but only slightly - see randomly-selecting-lines-from-files. In short there is a way to pick random item from list/iterator with unknown length, while reading one item at a time and ensure that any item may be picked with equal probability. But this won't help with os.listdir()
because it already returns list
in memory that already contains all 1M+ entries - so you can as well ask it about len()
...
这篇关于在Python中从目录(包含大量文件)中选择一个随机文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!