如何使用GNU并行写多核排序

如何使用GNU并行写多核排序

本文介绍了如何使用GNU并行写多核排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

的GNU平行为使用一个或多个计算机并行执行作业的壳工具

例如,如果我想要写的多核版本厕所我可以这样做:

For example, if I want to write a multicore version of wc I could do:

cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'

我的问题是如何做到使用并行排序?我知道我应该做的是管道平行的一个合并排序的文件命令(就像在合并排序最后合并)的结果,但我不知道该怎么做。

My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.

推荐答案

有几个方法可以做到这一点。

There's a few ways to do this.

让我们得到一个简单的文本文件一起玩:

Let's get a simple text file to play with:

$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
   tr " " "\n" | tr "[A-Z]" "[a-z]" |
   sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt

$ wc moby-dick-words.txt

215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt

real    0m0.260s
user    0m0.462s
sys 0m0.004s

我们可以做排序的文本块,一次说10000字,并推迟一些辛苦,连续工作到汇合(排序-m )部分:

We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m) part:

$ mkdir tmp
$ time (
  cd tmp;
  split -l 1000 ../moby-dick-words.txt;
  parallel sort {} -o {}.sorted ::: x*;
  sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
  rm x* )

real    0m0.787s
user    0m0.495s
sys 0m0.103s

$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt

$ uniq -c moby-dick-sorted-merge.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

所以这个分裂文成连续的10000线块,用平行于每个块进行排序,然后使用排序-m 来排序块合并成一个完整的排序

So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m to merge the sorted chunks into a complete sort.

下方法是做的努力工作在分离阶段,而不是合并的阶段,从而使部分结果可以通过简单的猫合并在一起:

The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:

  $ rm tmp/*
  $ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
  $ time (
    cd tmp;
    parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
    parallel sort {}.txt -o {}.sorted.txt ::: $letters;
    cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
    rm *.txt )

  real  0m1.015s
  user  0m2.355s
  sys   0m0.510s
  $ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
  $ uniq -c moby-dick-words-sorted-split.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

下面我们(并行)分割由该行的第一个字符的文件;这些文件分别进行排序;然后合并是一个简单的串连。

Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.

请注意,这确实为娱乐/教育用途; GNU的更高版本排序已并行建造(看选项),它会做一个更好的工作莫过于此。和的合并方法的雨衣的版本可以在这个答案。

Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.

这篇关于如何使用GNU并行写多核排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 23:34