multithreading - Windows 和 Solaris 10 上的 std::async 性能

我正在 Windows 机器(使用 MSVS2015 编译)和运行 Solaris 10(使用 GCC 4.9.3 编译)的服务器上运行一个简单的线程测试程序。在 Windows 上，通过将线程从 1 增加到可用内核数量，我获得了显着的性能提升；但是，完全相同的代码在 Solaris 10 上根本看不到任何性能提升。

Windows 机器有 4 个内核(8 个逻辑)，Unix 机器有 8 个内核(16 个逻辑)。

这可能是什么原因？我正在使用 -pthread 进行编译，它正在创建线程，因为它在第一个“F”之前打印了所有“S”。我在 Solaris 机器上没有 root 访问权限，而且据我所知，没有安装工具可用于查看进程的关联。

示例代码:

#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>

std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);

double generate_randn(uint64_t iterations)
{
    // Print "S" when a thread starts
    std::cout << "S";
    std::cout.flush();

    double rvalue = 0;
    for (int i = 0; i < iterations; i++)
    {
        rvalue += randn(gen);
    }
    // Print "F" when a thread finishes
    std::cout << "F";
    std::cout.flush();

    return rvalue/iterations;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
        return 0;

    uint64_t count = 100000000;
    uint32_t threads = std::atoi(argv[1]);

    double total = 0;

    std::vector<std::future<double>> futures;
    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    // Start timing
    t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threads; i++)
    {
        // Start async tasks
        futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
    }
    for (auto &future : futures)
    {
        // Wait for tasks to finish
        future.wait();
        total += future.get();
    }
    // End timing
    t2 = std::chrono::high_resolution_clock::now();

    // Take the average of the threads' results
    total /= threads;

    std::cout << std::endl;
    std::cout << total << std::endl;
    std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}

最佳答案

作为一般规则，C++ 标准库定义的类没有任何内部锁定。从多个线程修改标准库类的实例，或者从一个线程读取它同时从另一个线程写入它，是未定义的行为，除非“该类型的对象被明确指定为可共享而没有数据竞争”。 ( N3337 ，第 17.6.4.10 和 17.6.5.9 节。)RNG 类没有“明确指定为可共享而没有数据竞争”。 ( cout 是“可与数据竞争共享”的 stdlib 对象的示例——只要您还没有完成 ios::sync_with_stdio(false) 。)

因此，您的程序是不正确的，因为它同时从多个线程访问全局 RNG 对象；每次请求另一个随机数时，生成器的内部状态都会被修改。在 Solaris 上，这似乎会导致访问序列化，而在 Windows 上，它可能反而导致您无法正确获得“随机”数字。

解决方法是为每个线程创建单独的 RNG。然后每个线程将独立运行，它们不会相互减慢速度，也不会踩到对方的脚趾。这是一个非常普遍原则的特例:多线程总是工作得更好，共享数据越少。

还有一个额外的问题需要担心:每个线程几乎同时调用 system_clock::now，因此您最终可能会得到一些以相同值作为种子的每线程 RNG。最好将它们全部从 random_device 对象中播种。 random_device 向操作系统请求随机数，不需要播种；但它可能很慢。 random_device 应该在 main 中创建和使用，并将种子传递给每个工作函数，因为从多个线程访问的全局 random_device (如本答案的前一版)与全局 default_random_engine 一样未定义。

总而言之，你的程序应该是这样的:

#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>

static double generate_randn(uint64_t iterations, unsigned int seed)
{
    // Print "S" when a thread starts
    std::cout << "S";
    std::cout.flush();

    std::default_random_engine gen(seed);
    std::normal_distribution<double> randn(0.0, 1.0);

    double rvalue = 0;
    for (int i = 0; i < iterations; i++)
    {
        rvalue += randn(gen);
    }
    // Print "F" when a thread finishes
    std::cout << "F";
    std::cout.flush();

    return rvalue/iterations;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
        return 0;

    uint64_t count = 100000000;
    uint32_t threads = std::atoi(argv[1]);

    double total = 0;

    std::vector<std::future<double>> futures;
    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    std::random_device make_seed;

    // Start timing
    t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threads; i++)
    {
        // Start async tasks
        futures.push_back(std::async(std::launch::async,
                                     generate_randn,
                                     count/threads,
                                     make_seed()));
    }
    for (auto &future : futures)
    {
        // Wait for tasks to finish
        future.wait();
        total += future.get();
    }
    // End timing
    t2 = std::chrono::high_resolution_clock::now();

    // Take the average of the threads' results
    total /= threads;

    std::cout << '\n' << total
              << "\nFinished in "
              << std::chrono::duration_cast<
                   std::chrono::milliseconds>(t2 - t1).count()
              << " ms\n";
}

关于multithreading - Windows 和 Solaris 10 上的 std::async 性能，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/39165445/