问题描述
我编写了一个使用 csplit
自动将文件分成 4 部分的 shell 程序,然后使用 nohup
在后台执行相同命令的四个 shell 程序和一个 while 循环将寻找这四个过程的完成情况最后cat
output1.txt ....output4.txt > finaloutput.txt
I wrote one shell program which divide the files in 4 parts automatically using csplit
and then four shell program which execute same command in background using nohup
and one while loop will look for the completion of these four processes and finally cat
output1.txt ....output4.txt > finaloutput.txt
但是后来我开始了解这个命令 parallel
并且我用大文件尝试了这个,但看起来它没有按预期工作.此文件是以下命令的输出 -
But then i came to know about this command parallel
and i tried this with big file but looks like it is not working as expected. This file is an output of below command -
for i in $(seq 1 1000000);do cat /etc/passwd >> data.txt1;done
time wc -l data.txt1
10000000 data.txt1
real 0m0.507s
user 0m0.080s
sys 0m0.424s
并行
time cat data.txt1 | parallel --pipe wc -l | awk '{s+=$1} END {print s}'
10000000
real 0m41.984s
user 0m1.122s
sys 0m36.251s
当我为 2GB 文件(约 1000 万)记录尝试此操作时,它花费了 20 多分钟.
And when i tried this for 2GB file(~10million) records it took more than 20 minutes.
这个命令是否只适用于多核系统(我目前使用的是单核系统)
Does this command only work on multi core system(I am using single core system currently)
nproc --all
1
推荐答案
--pipe
效率低下(虽然没有达到您测量的规模 - 您的系统出了点问题).它可以以 1 GB/s(总计)的速度交付.
--pipe
is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).
--pipepart
相反,效率很高.如果您的磁盘足够快,它可以以每核 1 GB/s 的速度提供.这应该是处理 data.txt1
的最有效方式.它会将 data.txt1
拆分为每个 CPU 内核的一个块,并将这些块输入到在每个内核上运行的 wc -l
中:
--pipepart
is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1
. It will split data.txt1
in into one block per cpu core and feed those blocks into a wc -l
running on each core:
parallel --block -1 --pipepart -a data.txt1 wc -l
您需要 20161222 或更高版本才能使 block -1
工作.
You need version 20161222 or later for block -1
to work.
这些是我的旧双核笔记本电脑的计时.seq 200000000
生成 1.8 GB 的数据.
These are timings from my old dual core laptop. seq 200000000
generates 1.8 GB of data.
$ time seq 200000000 | LANG=C wc -c
1888888898
real 0m7.072s
user 0m3.612s
sys 0m2.444s
$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 1m28.101s
user 0m25.892s
sys 0m40.672s
这里的时间主要是由于 GNU Parallel 为每个 1 MB 块生成一个新的 wc -c
.增加块大小使其更快:
The time here is mostly due to GNU Parallel spawning a new wc -c
for each 1 MB block. Increasing the block size makes it faster:
$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m26.269s
user 0m8.988s
sys 0m11.920s
$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m21.628s
user 0m7.636s
sys 0m9.516s
如前所述,如果文件中有数据,--pipepart
会快得多:
As mentioned --pipepart
is much faster if you have data in a file:
$ seq 200000000 > data.txt1
$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m2.242s
user 0m0.424s
sys 0m2.880s
所以在我的旧笔记本电脑上,我可以在 2.2 秒内处理 1.8 GB.
So on my old laptop I can process 1.8 GB in 2.2 seconds.
如果您只有一个内核并且您的工作依赖于 CPU,那么并行化将无济于事.如果大部分时间都花在等待上(例如等待网络),那么在单核机器上进行并行化是有意义的.
If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).
但是,您的计算机上的时间安排告诉我,这有很大问题.我建议您在另一台计算机上测试您的程序.
However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.
这篇关于Unix命令的并行执行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!