我使用Intel PCM进行细粒度的CPU测量。在我的代码中,我试图衡量缓存效率。
基本上,我首先将一个小数组放入L1高速缓存中(通过遍历它多次),然后启动计时器,再次遍历该数组一次(希望使用该高速缓存),然后关闭计时器。
PCM告诉我,我的L2和L3丢失率很高。我还用rdtscp
检查过,每个数组操作的周期为15(这比访问L1缓存的4-5个周期高得多)。
我期望的是,该阵列完全放置在L1高速缓存中,而L1,L2和L3的未命中率不会很高。
我的系统分别为L1,L2和L3提供32K,256K和25M。
这是我的代码:
static const int ARRAY_SIZE = 16;
struct MyStruct {
struct MyStruct *next;
long int pad;
}; // each MyStruct is 16 bytes
int main() {
PCM * m = PCM::getInstance();
PCM::ErrorCode returnResult = m->program(PCM::DEFAULT_EVENTS, NULL);
if (returnResult != PCM::Success){
std::cerr << "Intel's PCM couldn't start" << std::endl;
exit(1);
}
MyStruct *myS = new MyStruct[ARRAY_SIZE];
// Make a sequential liked list,
for (int i=0; i < ARRAY_SIZE - 1; i++){
myS[i].next = &myS[i + 1];
myS[i].pad = (long int) i;
}
myS[ARRAY_SIZE - 1].next = NULL;
myS[ARRAY_SIZE - 1].pad = (long int) (ARRAY_SIZE - 1);
// Filling the cache
MyStruct *current;
for (int i = 0; i < 200000; i++){
current = &myS[0];
while ((current = current->n) != NULL)
current->pad += 1;
}
// Sequential access experiment
current = &myS[0];
long sum = 0;
SystemCounterState before = getSystemCounterState();
while ((current = current->n) != NULL) {
sum += current->pad;
}
SystemCounterState after = getSystemCounterState();
cout << "Instructions per clock: " << getIPC(before, after) << endl;
cout << "Cycles per op: " << getCycles(before, after) / ARRAY_SIZE << endl;
cout << "L2 Misses: " << getL2CacheMisses(before, after) << endl;
cout << "L2 Hits: " << getL2CacheHits(before, after) << endl;
cout << "L2 hit ratio: " << getL2CacheHitRatio(before, after) << endl;
cout << "L3 Misses: " << getL3CacheMisses(before_sstate,after_sstate) << endl;
cout << "L3 Hits: " << getL3CacheHits(before, after) << endl;
cout << "L3 hit ratio: " << getL3CacheHitRatio(before, after) << endl;
cout << "Sum: " << sum << endl;
m->cleanup();
return 0;
}
这是输出:
Instructions per clock: 0.408456
Cycles per op: 553074
L2 Cache Misses: 58775
L2 Cache Hits: 11371
L2 cache hit ratio: 0.162105
L3 Cache Misses: 24164
L3 Cache Hits: 34611
L3 cache hit ratio: 0.588873
编辑:
我还检查了以下代码,仍然获得了相同的未成年人率(我本来希望得到几乎为零的未成年人率):
SystemCounterState before = getSystemCounterState();
// this is just a comment
SystemCounterState after = getSystemCounterState();
编辑2:正如一个评论所建议,这些结果可能是由于探查器本身的开销。因此,我不仅改变了代码遍历数组的次数(200,000,000次),而不是一次,以摊销探查器的开销。我仍然得到非常低的L2和L3缓存比率(%15)。
最佳答案
看来您从系统的所有内核中都遇到了l2和l3丢失
我在这里查看PCM实现:https://github.com/erikarn/intel-pcm/blob/ecc0cf608dfd9366f4d2d9fa48dc821af1c26f33/src/cpucounters.cpp
[1]在第1407行的PCM::program()
的实现中,我看不到任何将事件限制为特定进程的代码
[2]在第2809行的PCM::getSystemCounterState()
实施中,您可以看到事件是从系统上的所有内核收集的。因此,我将尝试将进程的cpu亲和力设置为一个核心,然后仅从该核心读取事件-使用此功能CoreCounterState getCoreCounterState(uint32 core)