即使我已经关注了这些帖子,多线程也会减慢我的代码的速度:

Multi-threaded GEMM slower than single threaded one?

Why is this OpenMP program slower than single-thread?

我认为所有的预防措施都得到了照顾:

  • 我的CPU是4核+超线程(有效地是8个),我运行的线程不超过4个
  • 每个线程处理的 vector 条目数似乎足够大(每个线程200万个)。 因此,任何假共享(高速缓存行问题)应可以忽略不计,因为大多数数据不会与其他线程的数据重叠。
  • 条目在内存中是连续的,高速缓存未命中的可能性很小。
  • 使用tmp变量进行连续操作,而不是直接将值分配给数组。
  • Release模式下的建筑物,Visual Studio
  • 线程之间没有关键点(它们不使用互斥体并且不共享数据)

  • 在测量时间时,我包括创建线程。当然,启动4个线程不会那么昂贵吗?

    1个线程:大约140毫秒

    4个线程:大约155毫秒

    主要:
    struct MyStruct {
       double val = 0;
    };
    
    
    size_t numEntries = 100e4;
    size_t numThreads = 4;
    std::vector<MyStruct> arr;
    
    
    void main(){
        arr.reserve(numEntries);
        for(size_t i=0; i<numEntries; ++i){
            MyStruct m{ i };
            arr.push_back(m);
        }
    
        //run several times
        float avgTime=0;
        for(size_t n=0; n<100; ++n){
            launchThreads(avgTime);
            //space out to make avgTime more even:
            std::this_thread::sleep_for(std::chrono::milliseconds(10));
    
        }
    
        avgTime /= 100;
    
        std::cout << "finished in " << avgTime <<"milliseconds\n";
        system("pause");
    }
    

    启动和运行线程:
    //ran by each thread
    void threadWork(size_t threadId){
        size_t numPerThread = (numEntries+numThreads -1) / numThreads;
    
        size_t start_ix = threadId * numPerThread;
    
        size_t endIx;
        if (threadId == numThreads - 1) {
            endIx = numEntries-1;//we are the last thread
        }
        else {
            endIx = start_ix + numPerThread;
        }
    
        for(size_t i=5; i<endIx-5; ++i){
            double tmp = arr[i].val;
    
            tmp += arr[i-1].val;
            tmp += arr[i-3].val;
            tmp += arr[i-4].val;
            tmp += arr[i-5].val;
            tmp += arr[i-2].val;
    
            tmp += arr[i+1].val;
            tmp += arr[i+3].val;
            tmp += arr[i+4].val;
            tmp += arr[i+5].val;
            tmp += arr[i+2].val;
    
            if(tmp > 0){ tmp *= 0.5f;}
            else{ tmp *= 0.3f; }
    
            arr[i].val = tmp;
        }
    }//end()
    
    
    //measures time
    void launchThreads(float &avgTime){
    
        using namespace std::chrono;
        typedef std::chrono::milliseconds ms;
    
        high_resolution_clock::time_point t1 = high_resolution_clock::now();
    
        std::vector<std::thread> threads;
        for (int i = 0; i <numThreads; ++i) {
            std::thread t = std::thread(threadWork, i);
            threads.push_back(std::move(t));
        }
    
        for (size_t i = 0; i < numThreads; ++i) {
            threads[i].join();
        }
        high_resolution_clock::time_point t2 = high_resolution_clock::now();
        ms timespan = duration_cast<ms>(t2 - t1);
        avgTime += timespan.count();
    }
    

    最佳答案

    以下是您的问题:

    for(size_t i=5; i<endIx-5; ++i){
               ^^^
    

    它应该是:
    for(size_t i=start_ix + 5; i<endIx-5; ++i){
               ^^^^^^^^^^^^^^
    

    关于c++ - 多线程会减慢程序速度:不会共享错误,不会互斥,不会丢失缓存,不会减少工作量,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52462481/

    10-11 22:53
    查看更多