本文介绍了可以对非原子的原子进行原子操作吗?指针比atomic安全且更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有十二个线程在读取一个指针,还有一个线程可能一个小时左右更改一次该指针.

I have a dozen threads reading a pointer, and one thread that may change that pointer maybe once an hour or so.

读者是超级,超级,超级时间敏感的.我听说atomic<char**>或进入主内存的速度有多快,我想避免这种情况.

The readers are super, super, super time-sensitive. I hear that atomic<char**> or whatever is the speed of going to main memory, which I want to avoid.

在现代(例如2012年及以后)服务器和高端台式机Intel中,如果读写正常,是否可以保证8字节对齐的常规指针不被撕裂?我的测试进行了一个小时,没有看到眼泪.

In modern (say, 2012 and later) server and high-end desktop Intel, can an 8-byte-aligned regular pointer be guaranteed not to tear if read and written normally? A test of mine runs an hour without seeing a tear.

否则,如果我以原子方式进行写入并且正常进行读取,会更好(或更差)吗?例如,将两者结合起来?

Otherwise, would it be any better (or worse) if I do the write atomically and the reads normally? For instance by making a union of the two?

请注意,还有其他关于混合原子操作和非原子操作的问题,它们没有指定CPU,因此讨论演变成语言法律.这个问题与规范无关,而是确切会发生什么,包括我们是否知道在未定义规范的情况下会发生什么.

推荐答案

x86永远不会破坏asm加载或将其存储为对齐的指针宽度值.该问题的这一部分以及您的其他问题()都是 为什么对整数进行赋值在x86上自然对齐的可变原子?

x86 will never tear an asm load or store to an aligned pointer-width value. That part of this question, and your other question (C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?) are both duplicates of Why is integer assignment on a naturally aligned variable atomic on x86?

这就是为什么atomic<T>对于编译器实施如此便宜的原因,以及为什么使用它没有任何缺点的原因.

This is part of why atomic<T> is so cheap for compilers to implement, and why there's no downside to using it.

在x86上读取atomic<T>的唯一实际成本是它无法跨同一变量的多次读取而优化为寄存器.但是,无论如何,您都需要使这种情况发生才能使程序正常工作(例如,使线程注意到指针的更新).在非x86上,只有mo_relaxed与普通的asm加载一样便宜,但是x86的强大内存模型甚至可以使seq_cst负载便宜.

The only real cost of reading an atomic<T> on x86 is that it can't optimize into a register across multiple reads of the same var. But you need to make that happen anyway for your program to work (i.e. to have threads notice updates to the pointer). On non-x86, only mo_relaxed is as cheap as a plain asm load, but x86's strong memory model makes even seq_cst loads cheap.

如果在一个函数中多次使用指针,请执行T* local_copy = global_ptr;,以便编译器可以将local_copy保留在寄存器中.可以将其视为从内存加载到专用寄存器中,因为这正是编译的方式.原子对象上的操作不会优化,因此,如果您想在每个循环中重新读取一次全局指针,请以这种方式编写源代码.或者在循环之外:以这种方式编写源代码,然后让编译器管理本地变量.

If you use the pointer multiple times in one function, do T* local_copy = global_ptr; so the compiler can keep local_copy in a register. Think of this as loading from memory into a private register, because that's exactly how it will compile. Operations on atomic objects don't optimize away, so if you want to re-read the global pointer once per loop, write your source that way. Or once outside the loop: write your source that way and let the compiler manage the local var.

显然,您一直在尝试避免使用atomic<T*>,因为您对std::atomic::load()纯加载操作的性能有很大的误解.除非您使用释放或放松的memory_order,否则std::atomic::store()会稍微慢一些,但是在x86 std :: atomic上,seq_cst加载没有额外的开销.

Apparently you keep trying to avoid atomic<T*> because you have a huge misconception about performance of std::atomic::load() pure-load operations. std::atomic::store() is somewhat slower unless you use a memory_order of release or relaxed, but on x86 std::atomic has no extra cost for seq_cst loads.

在这里避免使用atomic<T*>并没有性能优势.它可以安全,方便地实现您所需的一切,并且在大多数情况下都具有高性能.每个读取它的内核都可以访问其专用L1d缓存中的副本.写入会使该行的所有副本无效,因此写入者具有独占所有权(MESI),但是从每个内核进行的下一次读取将获得一个共享副本,该副本可以再次在其专用缓存中保持高温.

There is no performance advantage to avoiding atomic<T*> here. It will do exactly what you need safely and portably, and with high performance for your read-mostly use case. Each core reading it can access a copy in its private L1d cache. A write invalidates all copies of the line so the writer has exclusive ownership (MESI), but the next read from each core will get a shared copy that can stay hot in its private caches again.

(这是一致性缓存的优点之一:读者不必一直检查某些单个共享副本.作家被迫确保在写之前,任何地方都没有陈旧的副本.这全部由硬件来完成. ,而不使用软件asm指令.我们跨多个C ++线程运行的所有ISA都具有与缓存相关的共享内存,这就是为什么volatile这种工作方式可以滚动您自己的原子的原因(),就像以前在C ++ 11之前必须做的那样.就像您试图甚至使用volatile一样执行 一样,该功能仅在调试版本中起作用.绝对不要执行那个!)

(This is one of the benefits of coherent caches: readers don't have to keep checking some single shared copy. Writers are forced to make sure there are no stale copies anywhere before they can write. This is all done by hardware, not with software asm instructions. All ISAs that we run multiple C++ threads across have cache-coherent shared memory, which is why volatile sort of works for rolling your own atomics (but don't do it), like people used to have to do before C++11. Or like you're trying to do without even using volatile, which only works in debug builds. Definitely don't do that!)

原子加载编译为编译器用于其他所有操作的相同指令,例如mov.在asm级别,每个对齐的加载和存储都是原子操作(用于2个大小的幂,最多8个字节). atomic<T>仅 必须阻止编译器假定两次访问之间没有其他线程正在写入对象.

Atomic loads compile to the same instructions compilers use for everything else, e.g. mov. At an asm level, every aligned load and store is an atomic operation (for power of 2 sizes up to 8 bytes). atomic<T> only has to stop the compiler from assuming that no other threads are writing the object between accesses.

(与纯负载/纯存储不同,整个RMW的原子性并不相同' ; ptr_to_int++可以编译为lock add qword [ptr], 4.但是在无竞争的情况下,它仍然比高速缓存完全丢失到DRAM的速度要快得多,只需要内核内部具有一个高速缓存锁"即可该行的专有所有权.如果您除了在Haswell上背对背执行其他操作(例如 https://agner.org/optimize/),但在其他代码中间只有一个原子RMW可以与周围的ALU操作很好地重叠.)

(Unlike pure load / pure store, atomicity of a whole RMW doesn't happen for free; ptr_to_int++ would compile to lock add qword [ptr], 4. But in the uncontended case that's still vastly faster than a cache miss all the way to DRAM, just needing a "cache lock" inside the core that has exclusive ownership of the line. Like 20 cycles per operation if you're doing nothing but that back-to-back on Haswell (https://agner.org/optimize/), but just one atomic RMW in the middle of other code can overlap nicely with surrounding ALU operations.)

纯只读访问权限是使用原子的无锁代码与需要RWlock的东西相比真正闪耀的地方-atomic<>读者之间不会互相竞争,因此读端可完美缩放像这样的用例(或RCU或SeqLock ).

Pure read-only access is where lockless code using atomics really shines compared to anything that needs a RWlock - atomic<> readers don't contend with each other so the read-side scales perfectly for a use-case like this (or RCU or a SeqLock).

在x86上,加载(默认顺序)不需要任何屏障指令,这要归功于x86的硬件内存排序模型(程序顺序加载/存储,以及具有存储转发功能的存储缓冲区).这意味着您可以在使用指针的读取端获得完整的性能,而不必弱化到acquire或consume内存顺序.

On x86 a seq_cst load (the default ordering) doesn't need any barrier instructions, thanks to x86's hardware memory-ordering model (program order loads/stores, plus a store buffer with store forwarding). That means you get full performance in the read side that uses your pointer without having to weaken to acquire or consume memory order.

如果存储性能是一个因素,则可以使用std::memory_order_release,因此存储也可以只是普通的mov,而无需使用mfence或xchg消耗存储缓冲区.

If store performance was a factor, you could use std::memory_order_release so stores can also just be plain mov, without needing to drain the store buffer with mfence or xchg.

无论阅读什么,都会误导你.

Whatever you read has misled you.

即使在内核之间获取数据也不需要去实际的DRAM,而只需要共享最后一级的缓存.由于您使用的是Intel CPU,因此L3缓存是缓存一致性的支持.

Even getting data between cores doesn't require going to actual DRAM, just to shared last-level cache. Since you're on Intel CPUs, L3 cache is a backstop for cache coherency.

在内核写入高速缓存行之后,它仍将处于其私有L1d高速缓存中,处于MESI修改状态(并且在所有其他高速缓存中无效;这是MESI保持高速缓存一致性=在任何地方都没有行的过时副本)的方式.因此,来自该高速缓存行的另一个核心上的负载将在私有L1d和L2高速缓存中丢失,但是L3标签将告诉硬件哪个核心具有该行的副本.一条消息通过环形总线到达该核心,使其回写到L3. 从那里可以将其转发到仍在等待加载数据的核心.这几乎是内核间延迟所衡量的-一个内核上的存储到另一个内核上获得价值之间的时间.

Right after a core writes a cache line, it will still be in its private L1d cache in MESI Modified state (and Invalid in every other cache; this is how MESI maintains cache coherency = no stale copies of lines anywhere). A load on another core from that cache line will therefore miss in the private L1d and L2 caches, but L3 tags will tell the hardware which core has a copy of the line. A message goes over the ring bus to that core, getting it to write-back the line to L3. From there it can be forwarded to the core still waiting for the load data. This is pretty much what inter-core latency measures - the time between a store on one core and getting the value on another core.

所花费的时间(内核间延迟)与L3缓存中未命中的负载和必须等待DRAM的负载大致相似,例如40ns与70ns,具体取决于CPU.也许这就是你读的书. (许多核心Xeon在环形总线上有更多的跳数,并且在核心之间以及从核心到DRAM都有更多的延迟.)

The time this takes (inter-core latency) is roughly similar to a load that misses in L3 cache and has to wait for DRAM, like maybe 40ns vs. 70ns depending on the CPU. Perhaps this is what you read. (Many-core Xeons have more hops on the ring bus and more latency between cores, and from cores to DRAM.)

但这仅适用于写操作后的首次加载.数据由加载数据的核心上的L2和L1d缓存进行缓存,并在L3中处于共享"状态.之后,任何频繁读取指针的线程都会使该行在运行该线程的核心的快速专用L2甚至L1d高速缓存中保持热点. L1d缓存具有4-5个周期的延迟,并且每个时钟周期可以处理2个负载.

But that's only for the first load after a write. The data is cached by the L2 and L1d caches on the core that loaded it, and in Shared state in L3. After that, any thread that reads the pointer frequently will tend to make the line stay hot in the fast private L2 or even L1d cache on the core running that thread. L1d cache has 4-5 cycle latency, and can handle 2 loads per clock cycle.

该行将在L3中处于共享状态,其他任何核心都可以命中,因此只有第一个核心才支付全部核心间等待时间.

And the line will be in Shared state in L3 where any other core can hit, so only the first core pays the full inter-core latency penalty.

(在Skylake-AVX512之前,英特尔芯片使用内置的L3高速缓存,因此L3标签可以用作窥探过滤器,以实现内核之间基于目录的高速缓存一致性.如果某条线在某些专用高速缓存中处于共享"状态,则它也是有效的在L3中处于共享状态.即使在LX高速缓存不保持包含属性的SKX上,在内核之间共享数据之后,数据也会在L3中保留一段时间.)

(Before Skylake-AVX512, Intel chips use an inclusive L3 cache so the L3 tags can work as a snoop filter for directory-based cache coherence between cores. If a line is in Shared state in some private cache, it's also valid in Shared state in L3. Even on SKX where L3 cache doesn't maintain the inclusive property, the data will be there in L3 for a while after sharing it between cores.)

在调试版本中,每个变量在C ++语句之间存储/重新加载到内存中.这并不比通常的优化构建慢(通常)慢400倍,这一事实表明,在无竞争的情况下,当访问缓存时,内存访问并不会太慢. (将数据保存在寄存器中的速度比内存快,因此,调试版本通常很差.如果使用memory_order_relaxed将每个变量atomic<T>设置为atomic<T>,则与在没有优化的情况下进行编译类似,不同之处在于:诸如++之类的东西).只是要清楚一点,我不是说>是说atomic<T>使您的代码以调试模式的速度运行.每次源引用它时,都可能需要从内存中重新加载一个可能已异步更改的共享变量(通过缓存),而atomic<T>会这样做.

In debug builds, every variable is stored/reloaded to memory between C++ statements. The fact that this isn't (usually) 400 times slower than normal optimized builds shows that memory access isn't too slow in the un-contended case when it hits in cache. (Keeping data in registers is faster than memory so debug builds are pretty bad in general. If you made every variable atomic<T> with memory_order_relaxed, that would be somewhat similar to compiling without optimization, except for stuff like ++). Just to be clear, I'm not saying that atomic<T> makes your code run at debug-mode speed. A shared variable that might have changed asynchronously needs to be reloaded from memory (through the cache) every time the source mentions it, and atomic<T> does that.

正如我所说,读取atomic<char**> ptr将编译为x86上的mov负载,没有多余的限制,与读取非原子对象完全相同.

As I said, reading an atomic<char**> ptr will compile to just a mov load on x86, no extra fences, exactly the same as reading a non-atomic object.

除了会阻止某些编译时重新排序之外,并且像volatile一样阻止编译器假定该值从不改变,并且使负载脱离循环.这也阻止了编译器发明额外的读取方法.参见 https://lwn.net/Articles/793253/

Except that it blocks some compile-time reordering, and like volatile stops the compiler from assuming the value never changes and hoisting loads out of loops. It also stops the compiler from inventing extra reads. See https://lwn.net/Articles/793253/

您可能需要RCU,即使这意味着为那些非常不频繁的写入操作复制一个相对较大的数据结构. RCU使读取器真正是只读的,因此读取侧缩放非常完美.

You might want RCU even if that means copying a relatively large data structure for each of those very infrequent writes. RCU makes readers truly read-only so read-side scaling is perfect.

您的建议涉及多个RWlock的事物,以确保读取器始终可以使用一个.这仍然涉及到某些共享高速缓存行上的原子RMW,所有读者都争相修改.如果您的读取器采用RWlock,则当它们使包含锁定的高速缓存行进入MESI Modified状态时,它们可能会因内核间延迟而停顿.

Other answers to your C++11/14/17: a readers/writer lock... without having a lock for the readers? suggested things involving multiple RWlocks to make sure a reader could always take one. That still involves an atomic RMW on some shared cache line that all readers contend to modify. If you have readers that take an RWlock, they probably will stall for inter-core latency as they get the cache line containing the lock into MESI Modified state.

(用于解决避免读者之间争用的硬件锁消除技术,但是它已被微代码更新禁用.)

(Hardware Lock Elision used to solve the problem of avoiding contention between readers but it's been disabled by microcode updates on all existing hardware.)

这篇关于可以对非原子的原子进行原子操作吗?指针比atomic安全且更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-22 14:50