问题描述
我想了解分支预测单元如何在 CPU 中工作.
I am trying to understand how does a branch prediction unit work in a CPU.
我使用了 papi
和 linux 的 perf-events
但它们都没有给出准确的结果(就我而言).
I have used papi
and also linux's perf-events
but both of them do not give accurate results (for my case).
这是我的代码:
void func(int* arr, int sequence_len){
for(int i = 0; i < sequence_len; i++){
// region starts
if(arr[i]){
do_sth();
}
// region ends
}
}
我的数组由 0 和 1 组成.它有一个大小为 sequence_len
的模式.例如,如果我的尺码是 8,那么它的模式是 0 1 0 1 0 0 1 1
或类似的东西.
My array consists of 0's and 1's. It has a pattern with a size of sequence_len
. For example, if my size is 8, then it has a pattern of 0 1 0 1 0 0 1 1
or something like that.
试验 1:
我试图了解 CPU 如何预测这些分支.因此,我使用 papi 并为错误预测的分支预测设置了性能计数器(我知道它也计算间接分支).
I am trying to understand how CPU predicts those branches. So, I have used papi and set up performance counter for branch predictions mispredicted (I know that it also counts indirect branches).
int func(){
papi_read(r1);
for(){
//... same as above
}
papi_read(r2);
return r2-r1;
}
int main(){
init_papi();
for(int i = 0; i < 10; i++)
res[i] = func();
print(res[i]);
}
我看到的输出是(对于 200 的序列长度)
What I see as an output is that (for sequence length of 200)
100 #iter1
40 #iter2
10 #iter3
3
0
0
#...
所以,一开始,CPU 盲目预测序列,成功率只有一半.在接下来的迭代中,CPU 可以预测得越来越好.经过一定数量的迭代后,CPU 可以完美地猜测.
So, at first, the CPU blindly predicts the sequence, only success half of the time. In the next iterations, the CPU can predict better and better. After some amount of iterations, the CPU can guess that perfectly.
试用 2
我想看看,哪个数组索引会导致 CPU 错误预测.
I would like to see, at which array index does the CPU misprediction.
int* func(){
int* results;
for(){
papi_read(r1);
if(arr[i])
do_sth();
papi_read(r2);
res[i] = r2-r1;
}
return res;
}
int main(){
init_papi();
for(int i = 0; i < 10; i++)
res[i] = func();
print(res[i]);
}
预期结果:
#1st iteration, 0 means no mispred, 1 means mispred
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
0 0 0 0 1 0 0 0 1 0 0... # total of 200 results
Mispred: 40/200 # it learned from previous iteration
#3rd iteration
0 0 0 0 0 0 0 0 1 0 0... # total of 200 results
Mispred: 10/200 # continues to learn
#...
收到的结果:
#1st iteration
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
1 0 0 0 1 1 0 1 0 0 0... # total of 200 results
Mispred: 100/200 # it DID NOT learn from previous iteration
#3rd iteration
0 1 0 1 0 1 0 1 1 0 0... # total of 200 results
Mispred: 100/200 # NO LEARNING
#...
我的观察
当我在 for 循环之外测量错误预测时,我可以看到 CPU 从它的错误预测中学习.但是,当我尝试测量单个分支指令的错误预测时,CPU 要么无法学习,要么测量错误.
When I measure the misprediction outside of the for loop, I can see that CPU learns from its mispredictions. However, when I try to measure single branch instructions misprediction, then the CPU either cannot learn, or I am measuring it wrongly.
我的解释
我给出了 200 作为序列长度.CPU 有一个小的分支预测器,如 Intel 中的 2-3 位饱和计数器,以及一个大的全局分支预测器.当我在环路外进行测量时,我会在测量中引入较少的噪声.减少噪音是指 papi
调用.
I am giving 200 as a sequence length. The CPU has one small branch predictor, like 2-3 bit saturated counter in Intels, and one big global branch predictor. When I measure outside of the loop, I introduce less noise to the measurement. By less noise, I mean the papi
calls.
考虑一下:环外测量
全局历史是:papi_start, branch_outcome1, branch_outcome2, branch_outcome3, ..., papi_end, papi_start (主迭代的第二个循环), branch_outcome1, ...
因此,分支预测器以某种方式在同一分支中找到了模式.
So, the branch predictor somehow finds the pattern in the same branch.
但是,如果我尝试测量单个分支指令,那么全局历史是:papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...
However, if I try to measure single branch instruction then the global history is:papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...
因此,我正在为全球历史引入越来越多的分支.我假设全局历史不能包含许多分支条目,因此,它无法在所需的 if 语句(分支)中找到任何相关性/模式.
So, I am introducing more and more branches to global history. I assume the global history cannot hold many branch entries and therefore, it cannot find any correlation/pattern in the desired if statement(branch).
结果
我需要测量单个分支预测结果.我知道如果我不过多介绍papi,CPU可以学习200模式.我查看了 papi 调用,并且看到了很多 for 循环,如果条件.
I need to measure a single branch prediction outcome. I know that the CPU can learn the 200 pattern if I don't introduce papi too much. I have looked at the papi calls and I have seen lots of for loops, if conditions.
这就是为什么我需要更好的测量.我尝试过 linux perf-event
但它进行 ioctl
调用,这是一个系统调用,我用系统调用污染了全局历史记录,因此,不是一个好的度量.
That is why I need better measurement. I have tried linux perf-event
but it makes ioctl
calls, which is a system call and I pollute the global history with system calls, and therefore, not a good measurement.
我已经阅读了 rdpmc
和 rdmsr
指令,我假设因为它们只是指令,所以我不会污染全局历史,我可以测量单个分支指令一次.
I have read that rdpmc
and rdmsr
instructions and I assume that since they are only instructions, I will not pollute the global history, and I can measure single branch instruction at a time.
但是,我不知道如何做到这一点.我有 AMD 3600 CPU.这些是我在网上找到的链接,但我不知道如何做到这一点.除此之外,我还遗漏了什么吗?
However, I have no clue about how I can do that. I have AMD 3600 CPU. These are the links that I found online but I couldn't figure out how to do that. In addition to that, am I missing something?
推荐答案
您已经假设 PAPI 和/或 perf_events 代码的占用空间相对较小.这是不正确的.如果您将性能计数器事件更改为指令已停用"或CPU 周期未停止"之类的内容,您将能够看到此操作在您的软件环境中包含多少开销.详细信息将取决于您的操作系统版本,但我预计开销将达到数百条指令/数千个周期,因为读取 perf_events 中的计数器(由 PAPI 使用)所需的内核交叉.代码路径肯定会包含自己的分支.
You have assumed that the PAPI and/or perf_events code has a relatively light footprint. This is incorrect. If you change the performance counter event to something like "instructions retired" or "CPU cycles not halted", you will be able to see how much overhead this operation contains in your software environment. The details will depend on your OS version, but I expect the overhead to be in the hundreds of instructions/thousands of cycles because of the kernel crossing required to read the counters in perf_events (which is used by PAPI). The code path will certainly include its own branches.
如果您的内核支持用户模式 RDPMC"(CR4.PCE=1),您可以使用一条指令读取性能计数器.https://github.com/jdmccalpin/low-overhead-timers 中提供了示例.
If your kernel supports "User-Mode RDPMC" (CR4.PCE=1), you can read a performance counter with a single instruction. Examples are available in https://github.com/jdmccalpin/low-overhead-timers.
即使将测量代码限制为本地 RDPMC 指令(以及用于保存结果的周围代码),测量也会破坏处理器管道.RDPMC 是微编码指令.在 Ryzen 内核上,指令执行 20 个微操作,每 20 个周期具有一条指令的吞吐量.(参考:https://www.agner.org/optimize/instruction_tables.pdf)
Even when limiting the measurement code to the native RDPMC instruction (and the surrounding code to save the results), measurements are disruptive to the processor pipeline. RDPMC is a microcoded instruction. On the Ryzen core, the instruction executes 20 micro-ops and has a throughput of one instruction per 20 cycles. (Ref: https://www.agner.org/optimize/instruction_tables.pdf)
任何细粒度的测量都具有挑战性,因为现代处理器的乱序功能与用户代码的交互方式记录不足且难以预测.有关此主题的更多说明(也与 AMD 处理器相关)位于 http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/一个>
Any measurements at fine granularities are challenging because the out-of-order capabilities of modern processors interact with the user code in ways that are poorly documented and difficult to anticipate. More notes on this topic (also relevant to AMD processors) are at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
这篇关于使用 rdmsr/rdpmc 提高分支预测精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!