软件预取手动指令合理时的情况 | 软件预取手动指令合理时的情况

本文介绍了软件预取手动指令合理时的情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

<$ p $

我已经阅读了关于x86和x86-64的介绍，英特尔，那么

例如 prefetch0 同样会读取变量。

但是，这将在所有缓存中占用一行，最终驱逐其他可能有用的行。
例如，当你知道你确实需要数据时，你可以使用它。

prefetch1 适合所有核心或核心群组快速使用数据（取决于共享L2的方式），而不会污染L1。

如果您知道您可能需要数据或您在完成其他任务后需要数据，则可以使用该数据（该数据优先于使用缓存）。
这并不像L1中的数据那样快，但比内存中的数据好得多。

prefetch2 可用于取出大部分内存访问延迟，因为它将数据移动到L3缓存中。

它不污染L1或L2，并且它在内核之间共享，所以它适用于罕见（但可能）代码路径使用的数据或为其他内核准备数据。

prefetchnta 是最容易理解的，它是一种非时间性的动作。它避免了在每个缓存行中为仅访问一次的数据创建条目。

prefetchw / prefetchwnt1 与其他版本相同，但会使行Exclusive，并使其他核心行无效。

基本上，它使写入速度更快，因为它处于MESI协议的最佳状态（用于缓存一致性）。

最后，预取可以逐步完成，首先进入L3，然后进入L1（仅用于需要它的线程）。 b
$ b

总之，每条指令都可以让您决定污染，共享和访问速度之间的妥协。

因为这些都需要跟踪缓存的使用仔细地（你需要知道，不值得在L1中创建和登录，但它在L2中），它的使用仅限于特定的环境。

在现代操作系统中，不可能跟踪的缓存中，您可以执行预取，只是为了找到您的量程已过期，并将您的程序替换为刚刚装载的线路的另一个程序。

至于一个具体的例子，我有点不合时宜。

过去，我必须尽可能一致地测量某个外部事件的时间。

我用中断来定期监视事件，在这种情况下，我预取了变量由中断处理程序所需，从而消除了第一次访问的延迟。 b
$ b另一个非正统的预取使用是将数据移入缓存。

这对于测试缓存系统非常有用或从内存中取消映射依赖于缓存的设备，以保持数据更长的时间。

在这种情况下移动到L3就足够了，并不是所有的CPU都有L3，所以我们可能需要移动到L2 。

我明白这些例子不是很好，但是。

实际上，粒度是缓存行而不是地址。

我假设您熟悉这个概念。简而言之：它现在从L1到L3 / L4。内核之间共享L3 / L4。 L1每个内核永远是私有的，并且由内核的线程共享，L2通常就像L1，但是某些型号可能会在内核对之间共享L2。

最长的部分是从RAM传输数据。计算地址并初始化事务会占用资源（例如存储缓冲区插槽和TLB条目）。

但是，用于访问内存的任何资源都可能成为一个关键问题，因为由@Leeor指出，由。

I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:

#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

Programs can use the _mm_prefetch intrinsic on anypointer in the program. And The different hints to be used with the _mm_prefetchintrinsic are implementation defined. Generally said is that each of the hints have its own meaning.

So can someone describe examples when this instruction used?

And how to properly choose the hint?

解决方案

The idea of prefetching is based upon these facts:

Accessing memory is very expensive the first time.
The first time a memory address is accessed is must be fetched from memory, it is then stored in the cache hierarchy.
Accessing memory is inherently asynchronous.
The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store and thus it can be easily done in parallel with other tasks.

Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.

The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take.Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.

That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.

Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.

For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines.You can use this for example when you know that you'll need the data surely in short.

prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.

prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.

prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.

prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).

Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).

In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.

As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.

Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.

I understand these examples are not very good, though.

Actually the granularity is "cache lines" not "addresses".
Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
However any resource used to access the memory can become a critical issue as pointed out by @Leeor and proved by the Linux kernel developer.

这篇关于软件预取手动指令合理时的情况的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！