我正在研究Google的细丝工作系统。目前,我正在研究他们实现的WorkStealingDequeue。您可以查看完整的源代码here。该数据结构基于此work。在执行弹出和窃取的过程中,他们将memory_order_seq_cst用作完整的内存屏障。

template <typename TYPE, size_t COUNT>
TYPE WorkStealingDequeue<TYPE, COUNT>::pop() noexcept {
    // mBottom is only written in push(), which cannot be concurrent with pop(),
    // however, it is read in steal(), so we need basic atomicity.
    //   i.e.: bottom = mBottom--;
    int32_t bottom = mBottom.fetch_sub(1, std::memory_order_relaxed) - 1;

    // we need a full memory barrier here; mBottom must be written and visible to
    // other threads before we read mTop.
    int32_t top = mTop.load(std::memory_order_seq_cst);

    if (top < bottom) {
        // Queue isn't empty and it's not the last item, just return it.
        return getItemAt(bottom);
    }

    TYPE item{};
    if (top == bottom) {
        // We took the last item in the queue
        item = getItemAt(bottom);

        // Items can be added only in push() which isn't concurrent to us, however we could
        // be racing with a steal() -- pretend to steal from ourselves to resolve this
        // potential conflict.
        if (mTop.compare_exchange_strong(top, top + 1,
                std::memory_order_seq_cst,
                std::memory_order_relaxed)) {
            // success: mTop was equal to top, mTop now equals top+1
            // We successfully poped an item, adjust top to make the queue canonically empty.
            top++;
        } else {
            // failure: mTop was not equal to top, which means the item was stolen under our feet.
            // top now equals to mTop. Simply discard the item we just poped.
            // The queue is now empty.
            item = TYPE();
        }
    }

    // no concurrent writes to mBottom possible
    mBottom.store(top, std::memory_order_relaxed);
    return item;
}

template <typename TYPE, size_t COUNT>
TYPE WorkStealingDequeue<TYPE, COUNT>::steal() noexcept {
    do {
        // mTop must be read before mBottom
        int32_t top = mTop.load(std::memory_order_seq_cst);

        // mBottom is written concurrently to the read below in pop() or push(), so
        // we need basic atomicity. Also makes sure that writes made in push()
        // (prior to mBottom update) are visible.
        int32_t bottom = mBottom.load(std::memory_order_acquire);

        if (top >= bottom) {
            // queue is empty
            return TYPE();
        }

        // The queue isn't empty
        TYPE item(getItemAt(top));
        if (mTop.compare_exchange_strong(top, top + 1,
                std::memory_order_seq_cst,
                std::memory_order_relaxed)) {
            // success: we stole a job, just return it.
            return item;
        }
        // failure: the item we just tried to steal was pop()'ed under our feet,
        // simply discard it; nothing to do.
    } while (true);
}

为了使实现正确,要求在pop()中的mTop之前先获取mBottom,在steal()中的mBottom之前先获取mTop。如果像大多数实现一样,将memory_order_seq_cst视为完整的内存屏障,那么上面的代码是正确的。但是据我了解,C++ 11并没有说将memory_order_seq_cst作为完整的内存屏障。据我了解,要确保正确的排序,那么mBottom fetch_sub操作必须至少为std::memory_order_acq_rel。我的分析正确吗?

然后是否有必要在mTop上使用memory_order_seq_cst? memory_order_seq_cst强制mTop上的所有操作都按单个总订单(STO)执行。但是在这种情况下,唯一参与STO的人是mTop。我相信我们已经有了修改顺序保证,其中指出每个线程必须就每个变量相对于其自身的修改顺序达成一致。 compare_exchange_strong操作中的memory_order_acq_rel是否足够?

最佳答案

此代码在steal中具有数据争用,因此不管内存顺序如何,其行为均未定义。

没有什么可以阻止窃取线程调用getItemAt(top)读取给定索引的值,而拥有该队列的工作线程调用push的次数足以缠绕缓冲区并覆盖该条目,或者调用pop的次数足以清空队列并然后调用push覆盖该条目。

例如mTop为0,mBottom为1 =>队列中只有一个元素。

窃取线程读取mTopmBottomtop<bottom,因此它会调用getItemAt(top)并由于任务切换而被操作系统挂起。

工作线程调用pop。它读取mBottom并将bottom设置为0。然后读取top(0)。 0==0,因此我们调用getItemAt(bottom)来检索项目。然后将mTop递增为1,并将mBottom设置为1。

然后,工作线程调用push并调用setItemAt(mBottom)设置下一个元素,现在是元素1。

现在,工作线程会重复此push/pop舞蹈COUNT次,因此队列中永远不会有多个元素,但是每次递增mTopmBottom时, Activity 元素将绕缓冲区移动,直到mBottom & MASK再次为0。

工作线程调用push,然后调用setItemAt(mBottom),后者访问元素0。操作系统恢复窃取线程,该线程也正在访问元素0 =>读取和写入相同位置,而无需订购=>数据争用和未定义的行为。

仅当TYPEstd::atomic<T>T时,这才可以。

假设COUNT足够大以至于在实践中永远不会发生,那么pushmBottom写入memory_order_release,而stealmemory_order_acquire读取。这意味着将在steal中读取相关数据项之前进行写操作,因此可以读取该项目。由于称为“释放序列”的概念,即使使用fetch_subpop中的memory_order_relaxed也可以看到。

在负载上使用memory_order_seq_cst并成功进行mTop的比较交换会强制将mTop上的操作转换为单个全局总顺序。但是,关于mToppop加载的注释是错误的:memory_order_seq_cst的使用不会阻止mBottom.fetch_sub调用重新排序,因为这是load中的mTop,并且fetch_sub调用使用memory_order_relaxedmemory_order_seq_cst上的load不会对从同一线程到其他变量的非memory_order_seq_cst写入施加任何顺序。

我目前不确定这可能会对代码产生什么影响。

07-26 09:26