有没有什么办法可以写出Intel CPU直接核对核通信的代码?如何强制cpu核心刷新c中的存储缓冲区?x86 MESI 使缓存线延迟问题无效强制将缓存行迁移到另一个缓存行核心(不可能)有趣的事实:PowerPC 上的非 seq_cst 存储/加载可以在同一物理核心上的逻辑核心之间进行存储转发,使存储对 某些 其他核心可见,然后对 全局可见所有 其他核心.这是 AFAIK 线程不同意所有对象的全局存储顺序的唯一真正的硬件机制.将两个其他线程总是以相同的顺序看到对不同线程中不同位置的原子写入?.在包括 ARMv8 和 x86 在内的其他 ISA 上,可以保证存储同时对所有其他内核可见(通过提交到 L1d 缓存).对于负载,CPU 已经将需求负载优先于任何其他内存访问(因为当然执行必须等待它们.)负载之前的障碍只能延迟它.由于时间的巧合,这可能是最佳的,如果这使它看到它正在等待的商店而不是太快"并看到旧的缓存无聊值.但通常没有理由假设或预测 pause 或屏障在加载之前可能是一个好主意.负载后的屏障也不应该有帮助.稍后的加载或存储可能能够启动,但无序 CPU 通常以最旧的优先级执行任务,因此在此加载有机会获得其加载请求之前,稍后的加载可能无法填满所有未完成的加载缓冲区非核心发送(假设缓存未命中,因为最近存储了另一个核心.)我想我可以想象如果这个加载地址在一段时间内没有准备好(指针追逐情况)并且当地址确实成为时已经在进行中的最大核外请求数,我可以想象一个好处已知.任何可能的好处几乎肯定不值得;如果有那么多独立于这个负载的有用工作,它可以填满所有的非核心请求缓冲区(英特尔的 LFB),那么它很可能不在关键路径上,让这些负载运行可能是一件好事.TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster?What is the latency with and without fence, if so?I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.Example:T queue[MAX_SIZE]std::atomic<std::size_t> shared_producer_index;void producer(){ std::size_t private_producer_index = 0; for(;;) { private_producer_index++; // Handling rollover and queue full omitted /* fill data */; shared_producer_index.store( private_producer_index, std::memory_order_release); // Maybe barrier here or stronger order above? }}void consumer(){ std::size_t private_consumer_index = 0; for(;;) { std::size_t observed_producer_index = shared_producer_index.load( std::memory_order_acquire); while (private_consumer_index == observed_producer_index) { // Maybe barrier here or stronger order below? _mm_pause(); observed_producer_index= shared_producer_index.load( std::memory_order_acquire); // Switching from busy wait to kernel wait after some iterations omitted } /* consume as much data as index difference specifies */; private_consumer_index = observed_producer_index; }} 解决方案 Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained.Size of store buffers on Intel hardware? What exactly is a store buffer?In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)Is there any way to write for Intel CPU direct core-to-core communication code?How to force cpu core to flush store buffer in c?x86 MESI invalidate cache line latency issueForce a migration of a cache line to another core (not possible)Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight. 这篇关于除了提供必要的保证外,硬件内存屏障是否可以更快地了解原子操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-01 20:56