

我想估计由于运行Linux的x86-64(Intel Nehalem)计算机上的TLB丢失而导致的性能开销.我希望通过使用一些性能计数器来获得此估算值.是否有人指出什么是最好的估算方法?

I want to estimate the performance overhead due to TLB misses on a x86-64 (Intel Nehalem) machine running Linux. I wish to get this estimate by using some performance counters. Does anybody has some pointers on what is the best way to estimate this?




If you can get access to a "Westmere" based system the performance characteristics of your code should be quite similar to what you have on the "Nehalem", but you will have access to a new hardware performance counter event that measures almost exactly what you want.

在Westmere上,在等待处理TLB丢失时对性能损失的最佳估计可能来自硬件性能计数器事件08H,掩码04H"DTLB_LOAD_MISSES.WALK_CYCLES",它被描述为对"Cycles Page Miss Handler忙"进行计数由于第二级TLB中的加载未命中而导致页面漫游".在英特尔®64和IA-32架构软件开发人员手册"中对此进行了描述第3B卷:系统编程指南,第2部分"(文档编号:253669),可在线获得,网址为:"> http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b- part-2-manual.html

On Westmere, the best estimate of performance lost while waiting for TLB misses to be handled is probably from the hardware performance counter Event 08H, Mask 04H "DTLB_LOAD_MISSES.WALK_CYCLES", which is described as counting "Cycles Page Miss Handler is busy with a page walk due to a load miss in the Second Level TLB".This is described in "Intel® 64 and IA-32 Architectures Software Developer’s ManualVolume 3B: System Programming Guide, Part 2" (document number: 253669), available online athttp://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html


The reason this event is necessary is that TLB miss processing time is dominated by the time required to read the cache line containing the page table entry. If that cache line is in the L2 cache, then the overhead of a TLB misses will be very small (of the order of 10 cycles). If the line is in the L3 cache, then maybe 25 cycles. If the line is in memory, then ~200 cycles.

  • 如果上层页面翻译缓存中也有未命中的内容,它将需要多次访问内存以查找和检索所需的页面表条目(例如,).
  • 在某些处理器上,L2缓存计数器可以告诉您L2中有多少个表步被命中和未命中,而Nehalem则没有. (在这种情况下,这无济于事,因为在L3上撞到的TLB步伐也相当快,而您真正想要的是必须记入内存的TLB步伐.)
  • If there is also a miss in the upper-level page translation caches, it will take multiple trips to memory to find and retrieve the desired page table entry (e.g., https://stackoverflow.com/a/9674980/1264917).
  • On some processors the L2 cache counters can tell you how many table walks hit and missed in the L2, but not on Nehalem. (It would not help a lot in this case since TLB walks that hit in the L3 are also fairly fast and what you really want are the TLB walks that have to go to memory.)


08-06 18:01