问题描述
我需要比较两个目录结构,每个目录结构大约有十亿个文件(目录深度可达20个级别)
I need to compare two directory structures with around one billion files each (directory deepness up to 20 levels)
我发现通常的diff -r /location/one /location/two
速度慢.
多线程diff是否有实现?还是通过将shell
和diff
组合在一起可以实现?如果可以,怎么办?
Is there any implementation of multithreading diff? Or is it doable via combining shell
and diff
together? If so, how?
推荐答案
您的磁盘将成为瓶颈.
除非您正在使用tmpfs,否则您可能只会放慢速度.那说:
Unless you are working on tmpfs, you will probably only loose speed. That said:
find -maxdepth 1 -type d -print0 |
xargs -0P4 -n1 -iDIRNAME diff -EwburqN "DIRNAME/" "/tmp/othertree/DIRNAME/"
应该在比较树上做得相当不错(在本例中为.
与/tmp/othertree
).
should do a pretty decent job of comparing trees (in this case .
to /tmp/othertree
).
它现在有一个缺陷,因为它无法检测到.
中不存在的otherthree
中的顶级目录. 我将其留给读者练习-尽管您可以轻松地反向进行比较
It has a flaw right now, in that it won't detect toplevel directories in otherthree
that don't exist in .
. I leave that as an exercise for the reader - though you could easily repeat the comparison in reverse
xargs的参数-P4
指定您最多需要4个并发进程.
The argument -P4
to xargs specifies that you want at most 4 concurrent processes.
也可以查看 xjobs
实用性 在分离输出方面做得更好.我认为使用GNU xargs(如图所示)不能删除-q
选项,因为它会混合diff(?).
Also have look at the xjobs
utitlity which does a better job at separating the output. I think with GNU xargs (like shown) you cannot drop the -q
option because it will intermix the diffs (?).
这篇关于差异文件夹递归与多线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!