本文介绍了x86_64 CPU是否使用相同的缓存行通过共享内存在2个进程之间进行通信?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,现代x86_64上的所有高速缓存L1/L2/L3级别都是通过虚拟索引,带有物理标签.并且所有内核都通过QPI/HyperTransport上的缓存一致性协议MOESI/MESIF通过最后一级缓存-L3进行通信.

As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport.

例如,Sandybridge系列CPU具有4至16路高速缓存L3和page_size 4KB,那么这允许通过共享内存在不同内核上执行的并发进程之间交换数据.这是可能的,因为高速缓存L3不能同时包含与进程1的页面和进程2的页面相同的物理内存区域.

For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time.

这是否意味着每次当process-1请求相同的共享内存区域时,process-2便将其页面的缓存行刷新到RAM中,然后process-1加载与cache-相同的内存区域.进程1的虚拟空间中的页面行?真的很慢,还是处理器使用了一些优化?

Does this mean that every time when the process-1 requests the same shared memory region, then the process-2 flushes its cache-lines of page into the RAM, and then process-1 loaded the same memory region as cache-lines of page in virtual space of process-1? It's really slow or processor uses some optimizations?

现代的x86_64 CPU是否使用相同的缓存行,而不进行任何刷新,以通过共享内存在具有不同虚拟空间的2个进程之间进行通信?

Does modern x86_64 CPU use the same cache lines, without any flushes, to communicate between 2 processes with different virtual spaces via a shared memory?

Sandy Bridge Intel CPU-缓存L3:

Sandy Bridge Intel CPU - cache L3:

  • 8 MB-缓存大小
  • 64 B-缓存行大小
  • 128 K-行(128 K = 8 MB/64 B)
  • 16向
  • 8 K-路数集(8 K = 128 K行/16路)
  • 虚拟地址(索引)的
  • 13位[18:6]-定义当前设置的数字(这是标签)
  • 512 K-每个相同(虚拟地址/512 K)竞争同一组(8 MB/16路)
  • 低19位-对于确定当前设置的数字很重要

  • 8 MB - cache size
  • 64 B - cache line size
  • 128 K - lines (128 K = 8 MB / 64 B)
  • 16-way
  • 8 K - number sets of ways (8 K = 128 K lines / 16-way)
  • 13 bits [18:6] - of virtual address (index) defines current set number (this is tag)
  • 512 K - each the same (virtual address / 512 K) compete for the same set (8 MB / 16-way)
  • low 19 bits - significant for determining the current set number

4 KB-标准页面大小

4 KB - standard page size

我们有7个丢失的位[18:12]-也就是说,我们需要检查(7 ^ 2 * 16路)= 1024个缓存行.这与1024路缓存相同-因此非常慢.这是否意味着缓存L3(已物理索引,已物理标记)?

We have 7 missing bits [18:12] - i.e. we need to check (7^2 * 16-way) = 1024 cache lines. This is the same as 1024-way cache - so this is very slow. Does this mean, that cache L3 is (physically indexed, physically tagged)?

标记的虚拟地址中缺少的位的摘要(页面大小8 KB-12位):

Summary of missing bits in virtual address for tag (page size 8 KB - 12 bits):

  • L3(8 MB = 64 B x 128 K线),16路,8 K集,13位标签[18:6]-缺少7位
  • L2(256 KB = 64 B x 4 K线),8路,512组,9位标记[14:6]-缺少3位
  • L1(32 KB = 64 B x 512行),8路,64组,6位标记[11:6]-没有丢失的位

应为:

    在TLB查找后使用的
  • L3/L2(经过物理索引,物理标记)
  • L1(经过虚拟索引,物理标记)
  • L3 / L2 (physically indexed, physically tagged) used after TLB lookup
  • L1 (virtually indexed, physically tagged)

推荐答案

什么?如果两个进程都映射了页面,则它们都可以在缓存中命中同一行物理内存.

Huh what? If both processes have a page mapped, they can both hit in the cache for the same line of physical memory.

这是使用大型包含 L3高速缓存的英特尔多核设计的部分优势.一致性只需要检查L3标签,即可在另一个内核的L2或L1缓存中找到E或M状态的缓存行.

That's part of the benefit of Intel's multicore designs using large inclusive L3 caches. Coherency only requires checking L3 tags to find cache lines in E or M state in another core's L2 or L1 cache.

在两个内核之间获取数据仅需要写回L3.我忘记了在哪里记录.也许 http://agner.org/optimize/.我认为Nehalem之前的每个内核都有单独的缓存的CPU我认为必须刷新到DRAM以获得一致性.如果可以使用与检测一致性问题相同的协议直接将数据从缓存直接发送到缓存,则为IDK.

Getting data between two cores only requires writeback to L3. I forget where this is documented. Maybe http://agner.org/optimize/. CPUs before Nehalem that had separate caches for each core I think had to flush to DRAM for coherency. IDK if the data could be sent directly from cache to cache with the same protocol used to detect coherency issues.

映射到不同虚拟地址的同一高速缓存行将始终位于L1高速缓存的同一集合中.请参阅注释中的讨论:L2/L3缓存在物理上具有索引并在物理上具有标记,因此混叠永远不会成为问题. (只有L1可以从虚拟索引中获得速度上的好处.只有在地址转换完成之后才能检测到L1缓存未命中,因此物理地址可以及时准备就绪,可以探查更高级别的缓存.)

The same cache line mapped to different virtual addresses will always go in the same set of the L1 cache. See discussion in comments: L2 / L3 caches are physically-index as well as physically tagged, so aliasing is never a problem. (Only L1 could get a speed benefit from virtual indexing. L1 cache misses aren't detected until after address translation is finished, so the physical address is ready in time to probe higher level caches.)

还请注意,注释中的讨论错误地提到了Skylake降低L1缓存的关联性.实际上,与以前相比, Skylake L2 缓存的关联性较差(4路,低于SnB/Haswell/Broadwell中的8路). L1仍然像往常一样是8路32kiB:关联性的最大大小,使分页选择地址位不包含在索引中.因此,毕竟没有什么神秘之处.

Also note that the discussion in comments incorrectly mentions Skylake lowering the associativity of L1 cache. In fact, it's the Skylake L2 cache that's less associative than before (4-way, down from 8-way in SnB/Haswell/Broadwell). L1 is still 32kiB 8-way as always: the maximum size for that associativity that keeps the page-selection address bits out of the index. So there's no mystery after all.

另请参见关于该内核通过L1进行通信的HT线程的另一个问题.我说了更多有关缓存方式和设置的内容. (而且感谢Voo,我只是更正了它,说缓存索引选择了一个集合,而不是方法.:P)

Also see another answer to this question about HT threads on the same core communicating through L1. I said more about cache ways and sets there. (And thanks to Voo, I just corrected it to say that the cache index selects a set, not a way. :P)

这篇关于x86_64 CPU是否使用相同的缓存行通过共享内存在2个进程之间进行通信?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-23 05:30