本文介绍了原子操作,性病::原子和LT;>和写入的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

GCC编译如下:

#include <atomic>
std::atomic<int> a;
int b(0);

void func()
{
  b = 2;
  a = 1;
}

这样:

func():
    mov DWORD PTR b[rip], 2
    mov DWORD PTR a[rip], 1
    mfence
    ret

所以,澄清事情对我来说:

So, to clarify things for me:


  • 是任何其他线程读取'A'为1保证阅读'B'为2。

  • 为什么MFENCE在写'一'前不发生后。

  • 是在写A保证是一个原子(狭义,非C ++意义上的)操作,无论如何,和这是否适用于所有英特尔处理器?我从这个输出code假定如此。

此外,铛(V3.5.1 -O3)做到这一点:

Also, clang (v3.5.1 -O3)does this:

mov dword ptr [rip + b], 2
mov eax, 1
xchg    dword ptr [rip + a], eax
ret

这似乎更直白到我的一点心意

,但为什么不同的方法,什么是每个优势?

Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?

推荐答案

我把阅读,增量,或合并( A + = b )二原子变量。我还用 a.store(1,memory_order_release); 而不是 A = 1; 以避免收到超过排序必要的。

I put your example on godbolt, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed.

请参阅下面的(希望正确)的解释。 更新:我有在多线程做一次总是会产生 A + = 3000 。你可能得到较少的,如果 A 是不是原子。

Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.

有趣的事实:签署原子类型保证补概括,不像正常的符号类型。 C和C ++仍坚持留下符号整数溢出未定义在其他情况下的想法。有些CPU没有算术右移,所以留下未定义负数右移有一定道理,但除此之外,它只是感觉像一个荒谬的箍通过现在所有的CPU使用2的补和8位字节跳。 &LT; /夸夸其谈&GT;

Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>

时的任何其他线程读取'A'为1保证阅读'B'为2。

是的,因为所提供的担保的std ::原子

Yes, because of the guarantees provided by std::atomic.

现在我们正在进入内存模型和硬件运行上。

Now we're getting into the memory model of the language, and the hardware it runs on.

C11和C ++ 11有一个非常弱的内存排序模型,这意味着编译器是允许的,除非你告诉它不要重新排序内存操作。 (来源:)。即使是x86的目标机器,你在的编译的以阻止重新排序存储编译器的时间。 (例如,通常你会的希望的编译器吊 A = 1 出一个循环,也写入 B的

C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)

使用C ++ 11的原子类型让你对他们的操作相对于程序的其余部分完全顺序一致性排序,默认情况下。这意味着他们不只是原子多了不少。请参阅以下放宽到订购需要的正是,避免了昂贵的围栏作业。

Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.

为什么MFENCE写操作后发生'A'而不是之前。

是无操作使用x86强大的内存模型,所以编译器只是必须把店里 b 商店前 A 实施源头code排序。

StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.

全部顺序一致性还要求商店程序顺序以后的任何载荷前,在全球范围内订购/全局可见。

Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.

加载后可x86的重新排序商店。在实践中,会发生什么情况是乱序执行看到的指令流中的独立负载,并提前一个商店仍对数据等待被准备的执行它。总之,连续一致性禁止这一点,所以GCC使用 MFENCE ,这是一个完整的障碍,包括StoreLoad(的。( LFENCE / SFENCE 仅供weakly-有用就像有序操作 movnt ))

x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))

把这个另一种方式是C ++的文档使用方式:顺序一致性保证所有的线程看到的相同的顺序排列的所有变化。每个原子店后MFENCE保证这个线程看到存储从其他线程。 ,否则,我们的负载会看到我们的商店其他线程的负载看到我们的商店之前。一个StoreLoad屏障(MFENCE)延迟加载我们直到首先需要发生卖场后。

Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.

的ARM32 ASM为 B = 2; A = 1; 是:

The ARM32 asm for b=2; a=1; is:

# get pointers and constants into registers
str r1, [r3]     # store b=2
dmb sy           # Data Memory Barrier: full memory barrier to order the stores.
   #  I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that.  Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1  (a is 4 bytes after b)
dmb sy           # full memory barrier to order this store wrt. all following loads and stores.

我不知道ARM汇编,但是到目前为止,我已经想通了的,通常它的运算DEST,SRC1 [,SRC2] ,但加载和存储总是有寄存器操作数第一和内存操作数第二位。如果你已经习惯了86,其中内存操作数可以是源或DEST对于大多数非向量指令这真是不可思议。加载立即数也需要大量的指令,因为固定的指令长度只有余地有效载荷16B为 MOVW (移动字)/ MOVT (移动顶部)。

I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).

命名为单向内存屏障来自锁:

The release and acquire naming for one-way memory barriers comes from locks:


  • 一个线程修改共享数据结构,那么的发布的锁。解锁必须将所有的加载/存储到它的保护数据后,全局可见。 (StoreStore + LoadStore)

  • 另一个线程的获取的锁(读或RMW与释放店),以及获取变得全局可见后必须做的所有负载/存储共享数据结构。 (LoadLoad + LoadStore)

  • One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
  • Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)

需要注意的是STD:原子使用这些名称甚至是独立的围栏这是由负载获得或存储释放操作略有不同。 (见atomic_thread_fence,下同)。

Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).

发布/收购语义比生产者 - 消费者需要更强。只是需要单向StoreStore(生产者)和单向LoadLoad(消费者),而不LoadStore排序

Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.

一个共享的哈希表/作家锁定(例如)要求获取负载/释放店原子读 - 修改 - 写操作获取锁。 86 锁定XADD 是一个完整的障碍(包括StoreLoad),但ARM64具有负载联/负载,采集/存储释放版本存储条件做原子读 - 修改-writes。据我了解,这避免了StoreLoad屏障的需要甚至锁定。

A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.

写入的std ::原子类型相对于源$ C ​​$ C每隔内存访问(包括加载和存储)进行排序,默认情况下。您可以控制​​排序征收与。

Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.

在你的情况,你只需要你的制作人做出正确的顺序变得全局可见肯定专卖店,即一个StoreStore屏障店前 A 存储(memory_order_release)包括此和更多。 的std :: atomic_thread_fence(memory_order_release)只是一个1路StoreStore屏障所有门店。 86不StoreStore免费,因此所有的编译器必须做的就是把专卖店在源顺序。

In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.

发布,而不是seq_cst将是一个大的性能取胜,ESP。像x86体系结构,其中释放便宜/免费。这更是真,如果无争的情况是常见的。

Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.

读原子变量也施加负载的相对于所有其它的加载和存储完整的序列一致性。在x86上,这是免费的。 LoadLoad和LoadStore障碍是空操作和隐含在每个内存操作。您可以通过使用让您的code对弱有序的ISA更高效 a.load(的std :: memory_order_acquire)

Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).

注意的责令所有门店(或所有负载)释放的名字。在实践中,他们通常会发出是2路StoreStore或LoadLoad障碍硬件指令。 是该提案是什么成为目前的标准。你可以看到memory_order_release如何映射到 #LoadStore | #StoreStore 在SPARC RMO,我以为被列入部分原因是它拥有单独所有的障碍类型。 (嗯,在CP preF网页只提到订购专卖店,而不是LoadStore组成部分。它不是C ++标准,虽然如此,也许是全制式多说。)

Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)

memory_order_consume 是不是该用例够强。 会谈标志,表示其他数据已经就绪,并约 memory_order_consume 讲座。

memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.

消耗就足够了,如果你的标志是一个指向 B ,甚至一个指向一个结构或数组。但是,没有编译器知道如何做依赖性跟踪,以确保它把事情在ASM正确的顺序,因此当前实现始终将消耗获取。这是太糟糕了,因为除了DEC阿尔法(和C ++ 11的软件模型)每一个架构提供这种排序是免费的。

consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.

制片人仍然需要使用发布语义(一StoreStore障碍),以确保新的有效载荷是可见当指针更新。

The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.

这不是一个坏主意,用写code 消耗,如果你确定你理解的含义,不依赖于任何消耗并不能保证。在未来,一旦编译器更聪明,你的code编译,甚至没有在ARM / PPC屏障指令。实际的数据移动仍然有不同的CPU高速缓存之间发生,但薄弱的内存模型机,就可以避免等待任何无关的写入是可见的(例如,在生产划伤缓冲区)。

It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).

只要记住,你无法实际测试 memory_order_consume code实验,因为目前的编译器给你要强订购在code的请求。

Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.

这真的很难测试任何的这种实验,无论如何,因为它选择的时机敏感。此外,除非编译器重新排序操作(因为你没有告诉它不要),生产者 - 消费者线程将永远不会有对86个问题。你需要测试在ARM或PowerPC或东西,甚至尝试去寻找排序的问题在实践中发生的事情。

It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.

引用


  • :我报的是gcc的bug我发现 b = 2; a.store(1,MO_release); B = 3; 生产 A = 1; B = 3 在x86上,而不是 B = 3; A = 1;

  • https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with b=2; a.store(1, MO_release); b=3; producing a=1;b=3 on x86, rather than b=3; a=1;

:我也报一个事实,即ARM GCC使用两个 DMB SY 在一排 A = 1; A = 1; ,和x86 GCC也许可以用更少的操作MFENCE做。我不知道是否需要一个 MFENCE 之间的每个商店来保护信号处理程序从作出错误的假设,或者如果它只是一个失踪的优化。

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two dmb sy in a row for a=1; a=1;, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if an mfence between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.

正好覆盖这种情况下使用标志在线程之间传递一个非原子的有效载荷。

The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.

什么StoreLoad障碍(86 MFENCE)是:这表明,需要一个工作示例程序:的

What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/

控制依赖性障碍:

说,86只需要 LFENCE 为与写入的数据流写得像 movntdqa movnti 。 (NT =无时间)。除了绕过高速缓存,86 NT加载/存储有弱有序的语义。

Doug Lea says x86 only needs LFENCE for data that was written with "streaming" writes like movntdqa or movnti. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.

(指向书和其他的东西,他建议)。

http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).

线程或者强大的内存模式比较好,包括点数据的依赖是在HW几乎免费的,所以它的愚蠢跳过它,并把一个很大的负担上的软件。 (事情的Alpha(和C ++)没有,但一切一样)。返回从几个职位,看看Linus Torvalds公司有趣的侮辱,他才抽时间去解释他的观点更详细的/技术方面的原因。

Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.

这篇关于原子操作,性病::原子和LT;&GT;和写入的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 11:15