使用dispatch_group_async的并发代码的性能比单线程版本慢得多

本文介绍了使用dispatch_group_async的并发代码的性能比单线程版本慢得多的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在做一些实验，最近使用大量的随机数来产生正态分布钟形曲线。

方法很简单：

创建一个整数数组并将其取零。（我使用2001个整数）

重复计算此数组中的索引，并索引数组中的该条目，如下所示
- 999或1000倍。在每次迭代：
  - 以中心值（1000）种子数组索引
  - 生成一个随机数= + 1 / -1 。并在循环结束时将其添加到数组索引
  - ，在计算的数组索引处增加值。

由于随机值0/1往往会频繁出现，内环趋于保持接近中心值。大于/小于起始值的指数值越来越不寻常。

在大量重复后，数组中的值呈现正态分布的形状曲线。但是，我使用的高质量随机函数arc4random_uniform（）相当慢，需要大量的迭代才能生成平滑的曲线。

绘制1,000,000,000（10亿）点。在主线程上运行，大约需要16个小时。

我决定重写计算代码以使用dispatch_async，并在我的8核Mac Pro上运行它。 / p>

我最终使用dispatch_group_async（）来提交8个块，当所有的块都完成处理时，使用dispatch_group_notify（）来通知程序。

为了简化第一遍，所有8个块写入相同的NSUInteger值数组。在对一个数组条目进行读/修改写操作时，很少有机会产生争用条件，但在这种情况下，这只会导致一个值丢失。我计划在以后向数组增量添加一个锁（或者甚至在每个块中创建单独的数组，然后对它们求和。）

无论如何，我重构了代码使用dispatch_group_async（）并计算每个块中的总值的1/8，并将我的代码设置为关闭运行。对我来说，并发代码，虽然它最大的所有内核在我的Mac，比单线程代码运行 MUCH 慢。

当在单个线程上运行时，我每秒得到大约17,800点。当使用dispatch_group_async运行时，性能下降到更像665点/秒，或大约1/26快。我改变了我提交的块数 - 2，4或8，没有关系。性能是可怕的。我也试过简单地提交所有8块使用dispatch_async没有dispatch_group。这也没关系。

目前没有阻塞/锁定：所有块全速运行。我对于为什么并发代码运行速度慢一点完全神秘。

现在代码有点混乱，因为我重构它单线程或同时工作，所以我可以

下面是运行计算的代码：

  randCount = 2; 
 #define K_USE_ASYNC 1 
 
 #if K_USE_ASYNC 
 dispatch_queue_t highQ = dispatch_get_global_queue（DISPATCH_QUEUE_PRIORITY_HIGH，0）; 
 // dispatch_group_async 
 
 dispatch_group_t aGroup = Dispatch_group_create（）; 
 int totalJobs = 8; 
 for（int i = 0; i  {
 dispatch_group_async（aGroup，
 highQ，
 ^ {
 [self calculateNArrayPoints： KNumberOfEntries / totalJobs 
 usingRandCount：randCount]; 
}）; 
} 
 
 
 dispatch_group_notify（aGroup，
 dispatch_get_main_queue（），
 allTasksDoneBlock 
）; 
 #else 
 [self calculateNArrayPoints：KNumberOfEntries 
 usingRandCount：randCount]; 
 allTasksDoneBlock（）; 
 #endif

而且常用的计算方法是单线程和并发版本：

  +（void）calculateNArrayPoints：（NSInteger）pointCount usingRandCount：（int）randCount; 
 {
 int entry; 
 int random_index; 
 
 for（entry = 0; entry< pointCount; entry ++）
 {
 static int processed = 0; 
 if（entry！= 0&& entry％100000 == 0）
 {
 [self addTotTotalProcessed：processed]; 
 processed = 0; 
} 
 
 //以值1000开始（中心值）
 int value = 0; 
 
 //对于每个条目，将值+/- 1添加到值1000次。 
 int limit = KPinCount; 
 if（randCount == 2）
 if（arc4random_uniform（2）！= 0）
 limit--; 
 for（random_index = 0; random_index  {
 int random_value = arc4random_uniform（randCount）; 
 / * 
如果0，值 -  
如果1，没有更改
如果2，值++ 
 * / 
 if（random_value == 0） 
 value--; 
 else if（random_value == randCount-1）
 value ++; 
} 
 value + = 1000; 
 _bellCurveData [value] + = 1; 
 // printf（\\\
\\\
final value =％d\\\
，value）; 
 processed ++; 
} 
}

这是一个快速而又脏的学习项目。它可以在Mac和iOS上运行，因此它使用了一个共享的实用程序类。实用程序类只是类方法。没有创建实用程序方法的实例。它有一个令人尴尬的全局变量数。如果我最后对代码做任何有用的事情，我将重构它创建一个实用程序单例，并将所有全局变量转换为单例上的实例变量。

现在，它的工作原理，全局的使用不会影响结果，所以我离开它。

使用已处理变量的代码只是一种计算在并发模式下运行时已经计算了多少点的方法。我发现并发版本的可怕性能后，我添加了该代码，所以我相信它不是一个慢的原因。

。我写了大量的并发代码，这个任务是一个问题，所以没有理由不应该在所有可用内核上全速运行。

有没有人看到可能会导致这种情况的任何事情，或有任何其他洞察力？

方案

arc4random 在修改其状态时使用临界区。在非竞争的情况下（从解锁到锁定），关键部分是超快的，但在竞争的情况下（当试图锁定已经锁定的互斥体时），它必须调用操作系统并将线程

  u_int32_t 
 arc4random（）
 {
 u_int32_t rnd ; 
 
 THREAD_LOCK（）; 
 arc4_check_init（）; 
 arc4_check_stir（）; 
 rnd = arc4_getword（& rs）; 
 THREAD_UNLOCK（）; 
 
 return（rnd）; 
}

其中 THREAD_LOCK（）定义为

  #define THREAD_LOCK（）\ 
 do {\ 
 if（__isthreaded ）\ 
 _pthread_mutex_lock（& arc4random_mtx）; \ 
} while（0）

源：

以使其更快

您可以创建一个 Arc4Random 类，函数从arc4random.c。然后你有一个随机数生成器不再是线程安全的，但你可以为每个线程创建一个生成器。

I've been doing some experimentation lately on using large numbers of random numbers to generate "normal distribution" bell curves.

The approach is simple:

Create an array of integers and zero it out. (I'm using 2001 integers)
Repeatedly calculate indexes in this array and index that entry in the array, as follows
- Loop either 999 or 1000 times. On each iteration:
  - Seed an array index with the center value (1000)
  - Generate a random number = +1/-1. and add it to the array index
  - at the end of the loop, increment the value at the calculated array index.

Since random values 0/1 tend to occur about as frequently, the end index value from the inner loop above tend to stay close to the center value. Index values much larger/smaller than the starting value are increasingly unusual.

After a large number of repetitions, the values in the array take on the shape of a normal distribution bell curve. However, the high-quality random function arc4random_uniform() that I'm using is fairly slow, and it takes a LOT of iterations to generate a smooth curve.

I wanted to plot 1,000,000,000 (one billion) points. Running on the main thread, that takes about 16 hours.

I decided to rewrite the calculation code to use dispatch_async, and run it on my 8-core Mac Pro.

I ended up using dispatch_group_async() to submit 8 blocks, with a dispatch_group_notify() to notify the program when all the blocks have finished processing.

For simplicity on the first pass, all 8 blocks write to the same array of NSUInteger values. There is a small chance of a race condition on a read/modify write to one of the array entries, but in that case, that would simply cause one value to be lost. I was planning on adding a lock to the array increment later (or perhaps even creating separate arrays in each block, and then summing them after.)

Anyway, I refactored the code to use dispatch_group_async() and calculate 1/8 of the total values in each block, and set my code off to run. To my utter befuddlement, the concurrent code, while it maxes out all of the cores on my Mac, runs MUCH slower than the single-threaded code.

When run on a single thread, I get about 17,800 points plotted per second. When run using dispatch_group_async, the performance drops to more like 665 points/second, or about 1/26 as fast. I've varied the number of blocks I submit - 2, 4, or 8, it doesn't matter. Performance is awful. I've also tried simply submitting all 8 blocks using dispatch_async with no dispatch_group. That doesn't matter either.

There's currently no blocking/locking going on: All the blocks run at full speed. I am utterly mystified as to why the concurrent code runs slower.

The code is a little muddled now because I refactored it to work either single-threaded or concurrently so I could test.

Here's the code that runs the calculations:

randCount = 2;
#define K_USE_ASYNC 1

#if K_USE_ASYNC
    dispatch_queue_t highQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0);
    //dispatch_group_async

    dispatch_group_t aGroup = dispatch_group_create();
    int totalJobs = 8;
    for (int i = 0; i<totalJobs; i++)
    {
      dispatch_group_async(aGroup,
                           highQ,
                           ^{
                             [self calculateNArrayPoints: KNumberOfEntries /totalJobs
                                          usingRandCount: randCount];
                           });
    }


    dispatch_group_notify(aGroup,
                          dispatch_get_main_queue(),
                          allTasksDoneBlock
                          );
#else
    [self calculateNArrayPoints: KNumberOfEntries
                 usingRandCount: randCount];
    allTasksDoneBlock();
#endif

And the common calculation method, which is used by both the single-threaded and the concurrent version:

+ (void) calculateNArrayPoints: (NSInteger) pointCount usingRandCount: (int) randCount;
{
  int entry;
  int random_index;

  for (entry =0; entry<pointCount; entry++)
  {
    static int processed = 0;
    if (entry != 0 && entry%100000 == 0)
    {
      [self addTotTotalProcessed: processed];
      processed = 0;
    }

    //Start with a value of 1000 (center value)
    int value = 0;

    //For each entry, add +/- 1 to the value 1000 times.
    int limit  = KPinCount;
    if (randCount==2)
      if (arc4random_uniform(2) !=0)
        limit--;
    for (random_index = 0; random_index<limit; random_index++)
    {
      int random_value = arc4random_uniform(randCount);
      /*
       if 0, value--
       if 1, no change
       if 2, value++
       */
      if (random_value == 0)
        value--;
      else if (random_value == randCount-1)
        value++;
    }
    value += 1000;
    _bellCurveData[value] += 1;
    //printf("\n\nfinal value = %d\n", value);
    processed++;
  }
}

This is a quick-and-dirty learning project. It runs on both Mac and iOS, so it uses a shared utilities class. The utilities class is nothing but class methods. There is no instance of the utilities method created. It has an embarrassing number of global variables. If I end up doing anything useful with the code, I'll refactor it to create a utilities singleton, and convert all the globals to instance variables on the singleton.

For now, it works, and the hideous use of globals doesn't affect the outcome, so I'm leaving it.

The code that uses the "processed" variable is just a way of figuring out how many points have been calculated when run in concurrent mode. I added that code after I discovered the horrid performance of the concurrent version, so I'm confident it isn't a cause of the slowdown.

I'm stumped here. I've written a fair amount of concurrent code, and this task is an "embarrassingly parallel" problem, so there's no reason it shouldn't run at full tilt on all available cores.

Does anybody else see anything that might cause this, or have any other insights to offer?

解决方案

arc4random uses a critical section while modifying its state. The critical section is super-fast in the non-contended case (when changing from unlocked to locked), but in the contended case (when trying to lock a mutex that's already locked) it has to call the operating system and put the thread to sleep, which decreases performance much.

u_int32_t
arc4random()
{
    u_int32_t rnd;

    THREAD_LOCK();
    arc4_check_init();
    arc4_check_stir();
    rnd = arc4_getword(&rs);
    THREAD_UNLOCK();

    return (rnd);
}

where THREAD_LOCK() is defined as

#define THREAD_LOCK()                       \
    do {                            \
        if (__isthreaded)               \
            _pthread_mutex_lock(&arc4random_mtx);   \
    } while (0)

Source: Arc4 random number generator for OpenBSD

to make it faster

you could create a Arc4Random class that is a wrapper around the static arc4_* functions from arc4random.c . Then you have a random-number generator that is no longer threadsafe, but you could create one generator for each thread.

这篇关于使用dispatch_group_async的并发代码的性能比单线程版本慢得多的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！