





For example, if I want to write a multicore version of wc I could do:

cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'


My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.



There's a few ways to do this.


Let's get a simple text file to play with:

$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
   tr " " "\n" | tr "[A-Z]" "[a-z]" |
   sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt

$ wc moby-dick-words.txt

215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt

real    0m0.260s
user    0m0.462s
sys 0m0.004s

我们可以做排序的文本块,一次说10000字,并推迟一些辛苦,连续工作到汇合(排序-m )部分:

We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m) part:

$ mkdir tmp
$ time (
  cd tmp;
  split -l 1000 ../moby-dick-words.txt;
  parallel sort {} -o {}.sorted ::: x*;
  sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
  rm x* )

real    0m0.787s
user    0m0.495s
sys 0m0.103s

$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt

$ uniq -c moby-dick-sorted-merge.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

所以这个分裂文成连续的10000线块,用平行于每个块进行排序,然后使用排序-m 来排序块合并成一个完整的排序

So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m to merge the sorted chunks into a complete sort.


The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:

  $ rm tmp/*
  $ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
  $ time (
    cd tmp;
    parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
    parallel sort {}.txt -o {}.sorted.txt ::: $letters;
    cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
    rm *.txt )

  real  0m1.015s
  user  0m2.355s
  sys   0m0.510s
  $ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
  $ uniq -c moby-dick-words-sorted-split.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster


Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.

请注意,这确实为娱乐/教育用途; GNU的更高版本排序已并行建造(看选项),它会做一个更好的工作莫过于此。和的合并方法的雨衣的版本可以在这个答案。

Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.


08-19 23:34