问题描述
我正在尝试使用Python寻找一种更快的方法来筛选包含大约9个其他目录的大目录(大约1.1TB),并查找大于200GB或多个Linux服务器上类似文件的文件,必须是Python.
I am trying to use Python to find a faster way to sift through a large directory(approx 1.1TB) containing around 9 other directories and finding files larger than, say, 200GB or something like that on multiple linux servers, and it has to be Python.
我已经尝试了很多事情,例如使用脚本调用du -h,但是du太慢了,无法通过一个1TB的目录.我也尝试过像find ./+ 200G这样的find命令,但这也将永远成为现实.
I have tried many things like calling du -h with the script but du is just way too slow to go through a directory as large as 1TB.I've also tried the find command like find ./ +200G but that is also going to take foreeeever.
我也尝试了os.walk()和.getsize(),但这是同样的问题-太慢了.所有这些方法都需要花费数小时和数小时的时间,如果有人可以帮助我,我需要帮助找到其他解决方案.因为我不仅需要在一台服务器上搜索大文件,而且还必须通过SSH切换近300台服务器并输出包含200GB以上所有文件的庞大列表,而我尝试过的三种方法都不会能够做到这一点.任何帮助表示赞赏,谢谢!
I have also tried os.walk() and doing .getsize() but it's the same problem- too slow.All of these methods take hours and hours and I need help finding another solution if anyone is able to help me. Because not only do I have to do this search for large files on one server, but I will have to ssh through almost 300 servers and output a giant list of all the files > 200GB, and the three methods that i have tried will not be able to get that done.Any help is appreciated, thank you!
推荐答案
那是不对的,因为你做不到os.walk()
That's not true that you cannot do better than os.walk()
scandir
据说快2到20倍.
来自 https://pypi.python.org/pypi/scandir
在实践中,删除所有这些额外的系统调用会使os.walk()在Windows上的运行速度约为7-50倍,而在Linux和Mac OS X上的运行速度约为3-10倍.微观优化.
In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X. So we’re not talking about micro-optimizations.
从python 3.5开始,感谢 PEP 471 ,scandir
现在是内置的,在os
包中提供.小(未试用)示例:
From python 3.5, thanks to PEP 471, scandir
is now built-in, provided in the os
package. Small (untested) example:
for dentry in os.scandir("/path/to/dir"):
if dentry.stat().st_size > max_value:
print("{} is biiiig".format(dentry.name))
(当然有时需要stat
,但是使用os.walk
时,使用该函数时会隐式调用stat
.此外,如果文件具有某些特定的扩展名,则只能执行stat
当扩展名匹配时,保存更多)
(of course you need stat
at some point, but with os.walk
you called stat
implicitly when using the function. Also if the files have some specific extensions, you could perform stat
only when the extension matches, saving even more)
还有更多内容:
因此,迁移到Python 3.5+可以神奇地提高os.walk
的速度,而无需重写代码.
So migrating to Python 3.5+ magically speeds up os.walk
without having to rewrite your code.
根据我的经验,在网络驱动器上增加stat
调用会带来灾难性的性能,因此,如果您的目标是网络驱动器,则比起本地磁盘用户,您将从此增强功能中受益更多.
From my experience, multiplying the stat
calls on a networked drive is catastrophic performance-wise, so if your target is a network drive, you'll benefit from this enhancement even more than local disk users.
但是,获得网络驱动器性能的最佳方法是在本地安装驱动器的计算机上运行扫描工具(例如,使用ssh
).它不太方便,但是值得.
The best way to get performance on networked drives, though, is to run the scan tool on a machine on which the drive is locally mounted (using ssh
for instance). It's less convenient, but it's worth it.
这篇关于使用Python查找大文件的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!