本文介绍了X86原子RMW指令是否等待免费的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在x86上,像lock add dword [rdi], 1这样的原子RMW指令是使用现代CPU上的缓存锁定实现的.因此,在指令期间,缓存线被锁定.这是通过在读取值时获取行EXCLUSIVE/MODIFIED状态来完成的,并且在指令完成之前,CPU不会响应来自其他CPU的MESI请求.

On x86, atomic RMW instructions like lock add dword [rdi], 1 are implemented using cache locking on modern CPUs. So a cache line is locked for duration of the instruction. This is done by getting the line EXCLUSIVE/MODIFIED state when value is read and the CPU will not respond to MESI requests from other CPU's until the instruction is finished.

并发进度条件有两种,阻塞和非阻塞.原子RMW指令是非阻塞的. CPU硬件在保持高速缓存锁定时将永远不会休眠或执行其他操作(中断发生在原子RMW之前或之后,而不是在此期间),在释放高速缓存行之前,步数的上限是有限的(且很小)

There are 2 flavors of concurrent progress conditions, blocking and non-blocking. Atomic RMW instructions are non-blocking. CPU hardware will never sleep or do something else while holding a cache lock (an interrupt happens before or after an atomic RMW, not during), there is a finite (and small) upper bound on the number of steps before a cache line is released.

非阻塞算法在理论计算机科学中可以分为三种:/p>

Non blocking algorithms can be split in 3 flavors in theoretical computer science:

  1. 免费等待:所有线程将在有限的步骤中取得进展.

  1. wait free: all threads will make progress in a finite number of steps.

无锁:至少一个线程将在有限的步骤中取得进展

lock free: at least one thread will make progress in a finite number of steps

畅通无阻:如果没有争用,线程将在有限的步骤中取得进展

obstruction free: if there is no contention, a thread will make progress in a finite number of steps

x86提供什么样的保证?

What kind of guarantee does x86 provide?

我猜它至少是无锁的;如果有争用,至少有一个CPU会取得进展.

I guess it is at least lock free; if there is contention, at least one CPU will make progress.

但是x86是否可以免费等待原子指令?是否可以保证每个CPU都能在有限的步骤中取得进展,还是一个或多个CPU处于饥饿状态并可能无限期地延迟?

But is x86 wait free for atomic instructions? Is every CPU guaranteed to make progress in a finite number of steps or could it be that one or more CPU's are starved and could potentially be delayed indefinitely?

那么,当有多个内核在同一条缓存行上执行原子操作时,会发生什么?

So what happens when there are multiple cores doing atomic operations on the same cache line?

推荐答案

当多个线程碰巧锁定同一缓存行时,它们的执行将被序列化.由于写入竞争 ="https://en.wikipedia.org/wiki/False_sharing" rel ="nofollow noreferrer">虚假共享.

When multiple threads happen to lock the same cache line, their execution is serialized. This is called write contention due to false sharing.

单作家原则源自这.与读取相反,写入不能同时执行.

The single-writer principle stems from this. Writes cannot be performed concurrently, as opposed to reads.

原子读-修改-写指令本身的执行时间是固定的,并且不取决于争用高速缓存行的线程数.因此,在x86上,它们无需等待人口不了解.

The execution time of atomic read-modify-write instructions themselves is fixed and does not depend on the number of threads contending the cache line. Therefore on x86 they are wait-free population-oblivious.

锁定竞争的高速缓存行所花费的时间上限与高速缓存行所经历的竞争程度成正比.

The upper bound of the time it takes to lock a contended cache line is proportional to how much contention the cache line experiences.

来自英特尔社区:

由于将连续重试高速缓存行锁定,因此最终所有原子性的读取-修改-写入操作都将成功( operation 是指令以及由硬件执行的重试操作以锁定高速缓存行).
因此,,保证每个CPU都能在有限数量的步骤,以及在x86上整体上的原子性读写修改操作为无等待限制.

Since cache line locking will be continuously retried, eventually all atomic read-modify-write operations will succeed (the operation is the instruction plus the retrying done by the hardware to lock the cache line).
So yes, every CPU is guaranteed to make progress in a finite number of steps, and atomic read-modify-write operations as a whole on x86 are wait-free bounded.

按照相同的逻辑,x86存储区 operation 是无等待限制的,x86存储区 instruction 是无需等待填充的,而x86负载始终是等待-免费人口无视.

By the same logic, the x86 store operation is wait-free bounded, x86 store instruction is wait-free population-oblivious, and x86 load is always wait-free population-oblivious.

同时,作为某人建议,一个ucode错误可能会导致该锁永远存在,我们描述算法的风格时,不要考虑外部因素,而应该只考虑逻辑本身.

While, as someone suggests, a ucode bug could cause the lock to stay on forever, we do not consider external factors when describing the flavor of an algorithm, but only the logic itself.

获取缓存行锁不公平.

选择线程获取锁的概率与关闭是释放锁的线程.因此,与共享L2高速缓存的线程相比,共享L2高速缓存的线程比同一内核上的线程更有可能获取锁定.然后,较短的QPI/UPI/NUMA节点路径上的线程比其他线程具有优势,依此类推.

The probability that a thread is selected to acquire the lock is proportional to how close it is to the thread that released the lock. So, threads on the same core are more likely to acquire the lock than threads that share the L2 cache, which are more likely than threads that share the L3 cache. Then, threads on shorter QPI/UPI/NUMA Node paths have an edge over others, and so on.

这同样适用于软件锁(自旋锁),因为发布发行存储时,它以相同的方式传播.

This holds true for software locks (spin locks) too, since when a release store is issued, it propagates the same way.

我在Intel i7 8700 (6c/12t)确认了以上所有内容.
在同一存储位置上连续lock xadd ...

I ran a benchmark on Intel i7 8700 (6c / 12t) that confirms all of the above.
When continuously lock xadding over the same memory location...

  • 在10秒内,在不同内核上运行的5个线程中,最快的线程lock xadd的速度是最慢线程的2.5倍,在不同的双向超线程上运行的10个线程中的速度提高了3倍
  • 3亿次,平均lock xadd的数量越来越小,花费的时间也越来越多,在不同内核上运行的5个线程的运行时间高达1.1毫秒,在不同双向双向超线程上运行的10个线程的运行时间高达193ms.线程
  • for 10 seconds, out of 5 threads running on different cores the fastest thread lock xadded 2.5 times more than the slowest one, and out of 10 threads running on different two-way hyper-threads it did 3 times more
  • 300 million times, on average increasingly tinier numbers of lock xadds take increasingly greater amounts of time, up to 1.1ms for 5 threads running on different cores and up to 193ms for 10 threads running on different two-way hyper-threads

不同过程之间的差异很大.

and variance across runs of different processes is high.

这篇关于X86原子RMW指令是否等待免费的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 16:57