问题描述
我从 将放宽排序作为信号 中了解到,存储在原子变量上应该在合理的时间内"对其他线程可见.
也就是说,我很确定它应该在很短的时间内发生(几纳秒?).但是,我不想依赖在合理的时间内".
所以,这里有一些代码:
std::atomic_bool canBegin{false};void functionThatWillBeLaunchedInThreadA() {if(canBegin.load(std::memory_order_relaxed))生产数据();}void functionThatWillBeLaunchedInThreadB() {canBegin.store(true, std::memory_order_relaxed);}
线程 A 和 B 都在一种 ThreadPool
中,所以在这个问题中没有线程的创建或任何内容.我不需要保护任何数据,因此这里不需要对原子存储/加载进行获取/消耗/释放排序(我认为?).
我们确定functionThatWillBeLaunchedInThreadA
函数将在functionThatWillBeLaunchedInThreadB
结束后启动.
然而,在这样的代码中,我们无法保证 store 在线程 A 中可见,因此线程 A 可以读取一个陈旧的值 (false
).>
这是我想到的一些解决方案.
解决方案 1:使用波动率
只需声明 volatile std::atomic_bool canBegin{false};
这里的 volatile 保证我们不会看到过时的值.
解决方案 2:使用互斥锁或自旋锁
这里的想法是通过互斥锁/自旋锁保护 canBegin 访问,通过释放/获取顺序保证我们不会看到过时的值.我也不需要 canGo
成为一个原子.
解决方案 3:完全不确定,但是内存栅栏?
也许这段代码不起作用,所以,告诉我:)
bool canGo{false};//现在不是原子值//在线程 Astd::atomic_thread_fence(std::memory_order_acquire);if(canGo) 生产数据();//在线程 BcanGo = 真;std::atomic_thread_fence(std::memory_order_release);
在 cpp 参考中,对于这种情况,它是这样写的:
在FB之前排序的所有非原子和松弛原子存储在线程 B 中将发生在所有非原子和松弛原子负载之前从 FA 之后在线程 A 中制作的相同位置
您会使用哪种解决方案?为什么?
您无法更快地使商店对其他线程可见.请参阅如果我不使用围栏,一个核心需要多长时间才能看到另一个核心的写入? - 障碍不会加速提高对其他核心的可见性,他们只是让这个核心等到那件事发生.
为此,RMW 的商店部分与纯商店也没有什么不同.
(当然在 x86 上;不完全确定其他 ISA,其中宽松的 LL/SC 可能会从存储缓冲区中获得特殊处理,如果此核心可以获得缓存的独占所有权,则可能更有可能在其他存储之前提交行.但我认为它仍然必须从乱序执行中退出,因此核心知道它不是推测性的.)
评论中链接的如果 RMW 在另一个线程的存储提交缓存之前运行,它不会看到该值,就像它是一个纯加载一样.这是否意味着陈旧"?不,这只是意味着商店还没有发生.
RMW 需要保证最新"的唯一原因价值在于它们本质上是在该内存位置上序列化操作.如果您希望 100 个未同步的 fetch_add
操作不相互踩踏并等效于 += 100,那么这就是您所需要的,但除此之外,尽力而为/最新可用的值很好,这就是您所需要的从正常的原子负载中获取.
如果您需要结果的即时可见性(一纳秒左右),那只能在单个线程中实现,例如 x = y;x += z;
另请注意,使商店在合理的时间内可见的 C/C++ 标准要求(实际上只是一个注释)是对操作顺序要求的补充.这并不意味着 seq_cst 存储可见性可以延迟到稍后加载之后.所有 seq_cst 操作都发生在所有线程的程序顺序的某种交错中.
在真实的 C++ 实现中,可见性时间完全取决于硬件内核间延迟.但是 C++ 标准是抽象的,理论上可以在需要手动刷新以使存储对其他线程可见的 CPU 上实现.然后由编译器决定不要偷懒并将其推迟太长时间".
volatile atomic
没用;编译器已经没有优化atomic
,所以抽象机完成的每一个atomic
访问都已经发生在asm中.(为什么编译器不合并冗余 std::atomic 写入?).这就是 volatile 的全部作用,因此 volatile atomic
编译为与 atomic
相同的 asm,可以使用原子进行任何操作.
定义陈旧"是一个问题,因为在不同内核上运行的单独线程无法立即看到彼此的操作.在现代硬件上需要数十纳秒才能从另一个线程查看商店.
但你不能读陈旧"缓存中的值;这是不可能的因为真正的 CPU 具有一致的缓存.(这就是为什么在 C++11 之前可以使用 volatile int
来滚动你自己的原子,但不再有用.)你可能需要一个比 relaxed
更强的排序来获得只要一个值比另一个值旧(即重新排序",而不是过时"),您想要的语义.但是对于单个值,如果您没有看到存储,则意味着您的加载在另一个核心获得缓存行的独占所有权以提交其存储之前执行.即商店还没有真正发生.
在正式的 ISO C++ 规则中,有关于您被允许看到的值的保证,这有效地为您提供了您期望从单个对象的缓存一致性中获得的保证,就像在读者看到商店之后,进一步此线程中的负载不会看到一些较旧的商店,然后最终会返回到最新的商店.(https://eel.is/c++draft/intro.multithread#intro.races-19).
(注意对于 2 个写入者 + 2 个读取者进行非 seq_cst 操作,读者可能会不同意存储发生的顺序.这称为 IRIW 重新排序,但大多数硬件不能这样做;只有一些PowerPC.对不同线程中不同位置的两次原子写入是否总是被其他线程以相同的顺序看到? - 所以它并不总是像存储尚未发生"那么简单,它对某些人来说是可见的线程在其他线程之前.但是您无法加快可见性仍然是正确的,只是例如减慢阅读器的速度,因此他们都不会通过早期"机制看到它,即使用 hwsync
PowerPC 首先加载以耗尽存储缓冲区.)
I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool
, so there is no creation of thread or whatsoever in this problem.I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadA
function will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB
.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false
).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false};
Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.I don't need canGo
to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
Which solution would you use and why?
There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add
operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T>
is useless; compilers already don't optimize atomic<T>
, so every atomic
access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T>
compiles to the same asm as atomic<T>
for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int
could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed
to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync
for the PowerPC loads to drain the store buffer first.)
这篇关于宽松的排序和线程间可见性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!