问题描述
答案
最后,我认为指南是另一个答案所说的:集思广益,实施,测试和衡量.您现在处在性能的前沿,不会有一个适合所有答案的尺寸.
另一个可能帮助您的资源是 Agner Fog的优化手册,它将为您提供帮助针对您的特定CPU进行优化.
The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.
My question is: which one do I WANT?
I work on a function that is called repeatedly, billions of times, with some int
parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:
int foo(int key) {
uint8_t value = cache[key];
_mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
// ...
The goal is to have this value
in a processor cache by the next call to this function.
I am looking for confirmation on my understanding of two points:
- The call to
_mm_prefetch
is not going to delay the processing of the instructions immediately following it. - There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.
That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to "force" it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?
As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.
That said:
_mm_prefetch
compiles into the PREFETCHn
instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).
AMD says (my emphasis):
What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).
Here's the full page for the instruction
I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.
Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.
这篇关于了解`_mm_prefetch`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!