本文介绍了了解`_mm_prefetch`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

答案

最后,我认为指南是另一个答案所说的:集思广益,实施,测试和衡量.您现在处在性能的前沿,不会有一个适合所有答案的尺寸.

另一个可能帮助您的资源是 Agner Fog的优化手册,它将为您提供帮助针对您的特定CPU进行优化.

The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.

My question is: which one do I WANT?

I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:

int foo(int key) {
  uint8_t value = cache[key];
  _mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
  // ...

The goal is to have this value in a processor cache by the next call to this function.

I am looking for confirmation on my understanding of two points:

  1. The call to _mm_prefetch is not going to delay the processing of the instructions immediately following it.
  2. There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.

That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to "force" it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?

解决方案

As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.

That said:

_mm_prefetch compiles into the PREFETCHn instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).

AMD says (my emphasis):

What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).

Here's the full page for the instruction

I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.

Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.

这篇关于了解`_mm_prefetch`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-23 00:57
查看更多