在英特尔Sandybridge家族CPU中为管线优化程序

在英特尔Sandybridge家族CPU中为管线优化程序

本文介绍了在英特尔Sandybridge家族CPU中为管线优化程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 限时删除!! 我一直在试图完成这项任务,一直在我的大脑,我希望有人在这里可以带我走向正确的道路。让我先从教师的指示开始:作业可以选择Whetstone或蒙特卡洛程序。缓存效率注释大多只适用于Whetstone,但我选择了蒙特卡罗模拟程序: // Un-修改的基线用于pessimization,如赋值 #include< algorithm> //需要max函数 #include< cmath> #include< iostream> //一个Box-Muller算法的简单实现,用于生成 //高斯随机数 - 对于下面的Monte Carlo方法是必需的 //注意C + +11实际上提供了std :: normal_distribution<> in // the< random>库,可以用来代替这个函数 double gaussian_box_muller(){ double x = 0.0; double y = 0.0; double euclid_sq = 0.0; //继续生成两个统一随机变量 //直到它们的欧几里德距离的平方 //小于unity do {x = 2.0 * rand()/ static_cast< double>(RAND_MAX)-1; y = 2.0 * rand()/ static_cast< double>(RAND_MAX)-1; euclid_sq = x * x + y * y; } while(euclid_sq> = 1.0); return x * sqrt(-2 * log(euclid_sq)/ euclid_sq); } //用Monte Carlo方法定价欧式vanilla调用选项 double monte_carlo_call_price(const int& num_sims,const double& S,const double& K,const double& ; r,const double& v,const double& T){ double S_adjust = S * exp(T *(r- 0.5 * v * v)); double S_cur = 0.0; double payoff_sum = 0.0; for(int i = 0; i double gauss_bm = gaussian_box_muller(); S_cur = S_adjust * exp(sqrt(v * v * T)* gauss_bm); payoff_sum + = std :: max(S_cur - K,0.0); } return(payoff_sum / static_cast< double>(num_sims))* exp(-r * T); } //使用Monte Carlo方法定价欧式vanilla put选项 double monte_carlo_put_price(const int& num_sims,const double& S,const double& K,const double& ; r,const double& v,const double& T){ double S_adjust = S * exp(T *(r- 0.5 * v * v)); double S_cur = 0.0; double payoff_sum = 0.0; for(int i = 0; i double gauss_bm = gaussian_box_muller(); S_cur = S_adjust * exp(sqrt(v * v * T)* gauss_bm); payoff_sum + = std :: max(K-S_cur,0.0); } return(payoff_sum / static_cast< double>(num_sims))* exp(-r * T); } int main(int argc,char ** argv){ //首先我们创建参数列表 int num_sims = 10000000; //模拟资产路径数量 double S = 100.0; //期权价格 double K = 100.0; //打击价格 double r = 0.05; //无风险利率(5%) double v = 0.2; //底层的波动率(20%) double T = 1.0; //一年到期满 //然后我们通过蒙特卡罗计算调用/投注值 double call = monte_carlo_call_price(num_sims,S,K,r,v,T) double put = monte_carlo_put_price(num_sims,S,K,r,v,T); //最后我们输出参数和价格 std :: cout<< 路径数:< num_sims<< std :: endl; std :: cout<< Underlying:<< S<< std :: endl; std :: cout<< 击打:< K std :: cout<< 无风险率:< r<< std :: endl; std :: cout<< Volatility:< v<< std :: endl; std :: cout<< 成熟度:< T std :: cout<< 致电价:<呼叫<< std :: endl; std :: cout<< Put Price:<< put<< std :: endl; return 0; } 我所做的更改似乎使代码运行时间增加了一秒,我不完全确定我可以改变,停止管道,而不添加代码。 更新:发送此作业的教授张贴了一些详细信息 重点是: 这是社区学院的第二学期建筑课程(使用轩尼诗和帕特森教科书)。 实验室计算机具有Haswell CPU 学生已接触过 CPUID 指令,以及如何确定缓存大小, CLFLUSH 指令。 允许任何编译器选项,因此是inline asm。 Cowmoogun对元线程的注释表示 was was' t清除编译器优化可能是其中的一部分,假设 -O0 ,运行时间增加17%是合理的。 所以听起来这个任务的目标是让学生重新排序现有的工作,以减少指令级的并行性或类似的东西,但这不是一个坏事,深入了解和学习更多。 请记住,这是一个计算机体系结构问题, 解决方案重要的背景资料: Agner Fog的microarch pdf ,可能还有Ulrich Drepper的每个程序员应该知道的内存。另请参阅 x86 标记维基中的其他链接,尤其是英特尔的优化手册,以及David Kanter对Haswell微体系结构的分析图。 很酷的任务;比我看到的学生要求优化 gcc -O0 ,学习一堆在实际代码中无所谓的技巧。在这种情况下,系统会要求您了解CPU流水线,并用它来指导您的非优化工作,而不仅仅是盲目猜测。 b 分配字词和代码问题: 此代码的uarch特定选项受到限制。它不使用任何数组,并且大部分成本是调用 exp / log 库函数。没有一个明显的方法来具有更多或更少的指令级并行性,并且循环运行的依赖链非常短。 我会爱看到一个答案,试图通过重新安排表达式来改变依赖性,减少 Intel Sandybridge系列CPU是强大的无序的设计,花费大量的晶体管和权力来寻找并行性和避免危害(依赖),这将麻烦一个经典的RISC,订单管道。通常,减慢速度的唯一传统危害是导致吞吐量受延迟限制的RAW真正依赖。 WAR和WAW危害寄存器几乎不是一个问题,感谢寄存器重命名。 (除了 popcnt / lzcnt / tzcnt false依赖它们的目的地在Intel CPU上,即使它是只写的。即WAW被处理为RAW危险+写)。对于内存排序,现代CPU使用存储队列将提交延迟到缓存,直到引退,也避免了WAR和WAW危害。 i7品牌名称是由Nehalem介绍的(Core2的继承者),有些英特尔手册甚至在Core i7意味着Nehalem,但他们保留了i7品牌 Sandybridge 和更高版本的微架构。 SnB是当P6家族进化成一个新的物种,SnB家庭。在许多方面,Nehalem与Pentium III的共同点比Sandybridge更多(例如,寄存器读取停顿和ROB读取停顿不发生在SnB上,因为它改变为使用物理寄存器文件,还有一个uop缓存和一个不同的内部uop格式)。 术语i7体系结构不适用,因为将NeB1系列与Nehalem分组,而不是将Core2分组是没有意义的。 (Nehalem确实引入了共享的包含L3缓存架构,用于将多个核心连接在一起,还有集成的GPU,因此芯片级,命名更有意义。) 魔鬼无能为力的好主意总结 即使魔鬼无能也不会增加明显无用的工作或无限的 多线程与一个单一的共享 std :: atomic< uint64_t> 循环计数器,因此发生正确的总迭代次数。原子uint64_t在使用 -m32 -march = i586 时尤其糟糕。对于奖励积分,安排它不对齐,并跨越不均匀分割(不是4:4)的页面边界。 假分享 不使用 - ,而是使用非原子变量 - >内存命中错误推测流水线,以及额外的缓存缺失。对FP变量,将高字节与0x80进行异或以翻转符号位,导致存储转发失败。 独立地计算每次迭代, RDTSC 。例如 CPUID / RDTSC 或一个进行系统调用的时间函数。 将常数乘以它们的倒数(为了方便阅读)。 使用AVX(SIMD)将乘法/ sqrt向量化,但无法使用 vzeroupper code>之前调用标量数学库 exp()和 log() > AVX SSE transition stalls 。 将RNG输出存储在链表中, 在此答案中也包括但不包括在摘要中:建议:将在非流水线CPU上一样缓慢,或者似乎甚至没有正当理由,即使与恶魔的无能。例如很多gimp-the-compiler的想法,产生明显不同/更糟的asm。 多线程不好 也许使用OpenMP到多线程循环,迭代次数很少,方法比速度增益更多。你的monte-carlo代码有足够的并行性实际上获得一个加速,虽然,esp。如果我们成功地使每次迭代慢。 (每个线程计算一个部分 payoff_sum ,在末尾添加)。 #omp parallel 可能是一个优化,而不是pessimization。 多线程但是强制两个线程共享相同的循环计数器( atomic 增量,因此迭代的总数是正确的)。这似乎是疯狂的逻辑。这意味着使用 static 变量作为循环计数器。这证明使用 atomic 作为循环计数器,并创建实际的高速缓存行乒乓(只要线程不在超线程的情况下在同一物理核心上运行;这可能不是慢)。无论如何,对于 lock inc ,这是慢于无争议的情况。并且 lock cmpxchg8b 在32位系统上原子地递增一个竞争的 uint64_t 将不得不在循环中重试,硬件仲裁原子 inc 。 还创建 false共享它们的私有数据(例如,RNG状态)在同一高速缓存行的不同字节中。 (英特尔教程,包括perf计数器看)。 这个有一个微架构特定的方面:Intel CPU猜测内存错误排序不是发生,并有一个记忆指令机-clear perf事件来检测这一点,至少在P4上。对Haswell的惩罚可能不大。正如该链接指出的, lock ed指令假定这将发生,避免误推测。正常负载推测其他内核不会使加载执行时和以程序顺序退出时的缓存行无效(,除非您使用 pause )。真正的共享没有锁 ed指令通常是一个错误。将非原子共享循环计数器与原子情况进行比较将是有趣的。为了真正pessimize,保持共享的原子循环计数器,并导致在相同或不同的缓存行的一些其他变量的假共享。 随机uarch特定的想法: 如果您可以引入任何不可预测的分支,将大大减少代码。现代x86 CPU具有相当长的管道,因此误预测花费大约15个周期(当从uop缓存运行时)。 依赖链: 我认为这是作业的预期部分之一。 击败CPU的能力通过选择具有一个长依赖链而不是多个短依赖链的操作顺序来利用指令级并行性。除非使用 -ffast-math ,否则编译器不允许更改FP计算的操作顺序,因为这会更改结果(如下所述)。 为了使这个有效,增加循环运行的依赖链的长度。没有什么是显而易见的,但是:所写的循环具有非常短的循环运载依赖链:只是一个FP添加。 (3个循环)。多次迭代可以一次计算它们的计算,因为它们可以在上一次迭代结束时的 payoff_sum + = 之前开始。 ( log()和 exp 需要很多指令,但不会超过 Haswell的用于查找并行性的乱序窗口:ROB大小= 192个融合域uops,调度器大小= 60 unfused-domain uops 。一旦当前迭代的执行进行得足够远,为来自下一次迭代的指令腾出空间,那么具有其输入就绪的任何部分(即独立/单独的dep链)可以在较早的指令(例如因为它们的延迟是瓶颈的,而不是吞吐量。) RNG状态几乎肯定是一个更长的循环传送依赖链, addps 。 使用更慢/更多的FP操作更多划分): 除以2.0而不是乘以0.5,等等。FP乘法在英特尔设计中流水线很大,每0.5c吞吐量Haswell和以后。 FP divsd / divpd 仅部分流水线。 (虽然Skylake有每个4c吞吐量令人印象深刻的一个 divpd xmm ,具有13-14c延迟,而不是在Nehalem(7-22c)的流水线。 do {...; euclid_sq = x * x + y * y; } while(euclid_sq> = 1.0); 显然正在测试一个距离,所以很清楚它 sqrt() :P( sqrt 比 div 更慢)。 As @Paul Clayton建议,使用关联/分布式等价物重写表达式可以引入更多的工作(只要不使用 -ffast-math 允许编译器重新优化)。 (exp(T *(r-0.5 * v * v))可以变为 exp 。请注意,虽然实数上的数学是关联的,但浮点数学不是 ,即使不考虑overflow / NaN(这是为什么 -ffast-math 默认情况下不启用)。参见保罗的评论一个非常毛茸茸的嵌套 pow() 如果您可以将计算缩小到非常小的数字,那么FP算术运算需要〜120个额外周期来捕获微码,正常数字产生反正常。请参阅Agner Fog的microarch pdf确切的数字和细节。这不太可能是因为你有很多乘法,所以比例因子将被平方和下降到0.0。没有看到任何方式证明必要的缩放与无能(甚至恶魔),只有故意的恶意。 使用内在函数(< immintrin.h> ) 使用 movnti 从高速缓存中驱逐数据。恶魔:它是新的和弱有序,所以应该让CPU运行得更快,对吧?或者看到那个链接的问题,有人有危险做这个(对于分散的写,只有一些地方是热的)。 在FP数学运算之间使用整数shuffle造成旁路延迟。 在没有正确使用 vzeroupper 的情况下混合SSE和AVX指令会在Skylake 之前导致大的停顿 在Skylake 中)。即使没有这样,矢量化可能比标量更糟糕(更多的循环花费数据进/出的向量,而不是通过add / sub / mul / div / sqrt操作一次进行4次Monte-Carlo迭代,使用256b向量) 。 add / sub / mul执行单元是完全流水线和全宽,但是256b向量上的div和sqrt不如128b向量(或标量)上那么快,因此加速对于 double 。 exp() 没有硬件支持,所以这部分需要提取向量元素回到标量,并单独调用库函数,然后将结果重新转换为向量。 libm通常被编译为仅使用SSE2,因此将使用标准数学指令的legacy-SSE编码。如果你的代码使用256b的向量,并调用 exp 而不做 vzeroupper 返回后,像 vmovsd 的AVX-128指令将下一个向量元素设置为 exp 失速。然后 exp()将在运行SSE指令时再次停止。 这正是发生的此问题,造成10倍的放缓。(感谢@ZBoson)。 另请参阅 Nathan Kurz的实验英特尔的数学lib和glibc这个代码。未来glibc将伴随着向量化的实现 exp()和 如果定位到IvB之前, Nehalem,尝试获得gcc引起部分寄存器停顿与16bit或8bit操作,其次是32bit或64bit操作。在大多数情况下,gcc将在8或16位操作后使用 movzx ,但 ah 然后读取的情况下的组合的动态分配寄存器 ax 使用(内联)asm,可以打破uop缓存:32B的代码块不适合三个6uop缓存线强制从uop缓存切换到解码器。一个无效的 ALIGN 使用许多单字节 nop 而不是一对夫妇 nop s在内循环中的分支目标可能会做的伎俩。或者将对齐填充放在标签之后,而不是之前。 :P这只有前端是瓶颈,如果我们成功地隐藏了其余的代码,就不会出现瓶颈。 使用自我修改代码触发管道清除(aka machine-nukes)。 来自16位指令的LCP停顿,其中立即数太大而不适合8位不可能是有用的。在SnB和以后的uop缓存意味着您只付一次解码惩罚。在Nehalem(第一个i7),它可能工作在一个循环,不适合在28 uop循环缓冲区。 gcc有时会生成这样的指令,即使使用 -mtune = intel ,并且它可以使用32位指令。 计时的常见习语为 CPUID (序列化),然后 RDTSC 。使用 CPUID / RDTSC 分别计算每次迭代的时间,以确保 RDTSC 不会与之前的指令重新排序,这将会减慢一个很多。 (在现实生活中,时间的聪明方式是将所有的迭代计时在一起,而不是分开计时和添加它们)。 导致大量缓存缺失和其他内存减速 使用 union {double d; char a [8]; } 为您的一些变量。 导致商店转发暂停 通过执行狭义商店(或读 - 修改 - 写)到仅一个字节。 (该wiki文章还涵盖了很多其他微架构负载/存储队列)。例如使用仅在高字节上的XOR 0x80,而不是 - double 的符号c>运算符。恶魔不称职的开发人员可能已经听说FP比整数慢,因此尽量使用整数操作。 (一个非常好的编译器,目标FP数学在SSE寄存器可能编译为一个 xorps 与另一个xmm寄存器中的常数,但唯一的方法这不是可怕的x87是 使用 volatile ,如果你使用 -O3 而不是使用 std :: atomic ,强制编译器实际存储/重新加载所有的地方。全局变量(而不是本地化)也会强制某些商店/重新加载,但 C ++内存模型的弱排序不需要编译器一直溢出/重新加载到内存。 用局部变量 在结构中使用数组来填充(并存储随机数,以证明它们的存在)。 p> 选择您的内存布局,以便一切进入不同的行在同一设置在L1缓存。它只有8路关联,即每个集合有8个路。缓存行是64B。 更好的是,将东西准确地分开4096B,因为加载对于不同页面的存储具有错误依赖性,页。侵略性无序CPU使用内存消歧来确定何时可以重新排序加载和商店,而不更改结果,而英特尔的实现具有假阳性,防止负载提前启动。可能它们只检查低于页偏移量的位,因此检查可以在TLB将高位从虚拟页面转换为物理页面之前开始。除了Agner的指南,请参阅来自斯蒂芬·佳能的答案,以及@Krazy Glew对同一个问题的回答的结尾部分。 (Andy Glew是英特尔原始P6微架构的架构师之一。) 使用 __属性__((压缩))以允许您错误对齐变量,以便它们跨越缓存行或甚至页面边界。 (因此,一个 double 的加载需要来自两个高速缓存行的数据)。未对齐的加载在任何Intel i7 uarch中没有损失,除非是跨越高速缓存行和页线。 缓存线拆分仍然需要额外的周期。 Skylake显着降低了分页加载的处罚,从100到5个循环。 (第2.1.3节)。 在原子< uint64_t> 应该只是最坏的情况,esp。如果它在一个页面中的5个字节和在另一个页面中的3个字节,或除了4:4之外的任何东西。即使在中间分裂对于在一些uarches,IIRC上的具有16B向量的高速缓存行分裂更有效。将所有内容放在 alignas(4096)struct __attribute((packed))(当然是为了节省空间),包括一个用于存储RNG结果的数组。通过在计数器之前使用 uint8_t 或 uint16_t 实现未对齐。 如果你可以让编译器使用索引寻址模式,那么失败uop微融合。也许可以使用 #define s来替换简单标量变量 my_data [constant] 。 如果您可以引入额外的间接级别,那么加载/存储地址不会及早知道,这可以进一步减少。 以非连续顺序遍历数组 我想我们可以想出一个无能为力的引用数组的理由:它让我们将随机数生成与随机数的使用分开。每个迭代的结果也可以存储在一个数组中,以便稍后进行求和(具有更多的恶魔无能)。 对于最大随机性,我们可以有一个线程循环在随机数组中写入新的随机数。消耗随机数的线程可以生成随机索引以加载随机数。 (There’s some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case). For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles).例如for (int i=0 ; i<512; i++) for (int j=i ; j<UPPER_BOUND ; j+=512) monte_carlo_step(rng_array[j]); So the access pattern is 0, 4096, 8192, ..., 8, 4104, 8200, ... 16, 4112, 8208, ... This is what you’d get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by @JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life. Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn’t that big. Hardware prefetching doesn’t work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn’t already saturated with non-prefetch. This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)). Instead of an array to store the list of results, you could use a linked list. Then every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a diabolically incompetent allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free). These aren’t really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU). Somewhat off-topic: make the compiler generate worse code / do more work: Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread. -m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]). -march=i386 will also pessimize (thanks @Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn’t provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph. Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3. If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters. atomic<double> compiles, but read-modify-write operations like += aren’t supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It’s probably really inefficient, because the x86 ISA doesn’t naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode. At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won’t matter with optimization settings that a real build of real code would use. C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.) Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don’t have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers. Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have greater-than-one latency, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.) Frequently set your CPU affinity to a different CPU (suggested by @Egwor). diabolical reasoning: You don’t want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they’re so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors. Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel’s help, since the kernel exports code in the vdso). For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes. I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions:The assignment gave a choice of Whetstone or Monte-Carlo programs. The cache-effectiveness comments are mostly only applicable to Whetstone, but I chose the Monte-Carlo simulation program:// Un-modified baseline for pessimization, as given in the assignment#include <algorithm> // Needed for the "max" function#include <cmath>#include <iostream>// A simple implementation of the Box-Muller algorithm, used to generate// gaussian random numbers - necessary for the Monte Carlo method below// Note that C++11 actually provides std::normal_distribution<> in// the <random> library, which can be used instead of this functiondouble gaussian_box_muller() { double x = 0.0; double y = 0.0; double euclid_sq = 0.0; // Continue generating two uniform random variables // until the square of their "euclidean distance" // is less than unity do { x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1; y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); return x*sqrt(-2*log(euclid_sq)/euclid_sq);}// Pricing a European vanilla call option with a Monte Carlo methoddouble monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) { double S_adjust = S * exp(T*(r-0.5*v*v)); double S_cur = 0.0; double payoff_sum = 0.0; for (int i=0; i<num_sims; i++) { double gauss_bm = gaussian_box_muller(); S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm); payoff_sum += std::max(S_cur - K, 0.0); } return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);}// Pricing a European vanilla put option with a Monte Carlo methoddouble monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) { double S_adjust = S * exp(T*(r-0.5*v*v)); double S_cur = 0.0; double payoff_sum = 0.0; for (int i=0; i<num_sims; i++) { double gauss_bm = gaussian_box_muller(); S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm); payoff_sum += std::max(K - S_cur, 0.0); } return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);}int main(int argc, char **argv) { // First we create the parameter list int num_sims = 10000000; // Number of simulated asset paths double S = 100.0; // Option price double K = 100.0; // Strike price double r = 0.05; // Risk-free rate (5%) double v = 0.2; // Volatility of the underlying (20%) double T = 1.0; // One year until expiry // Then we calculate the call/put values via Monte Carlo double call = monte_carlo_call_price(num_sims, S, K, r, v, T); double put = monte_carlo_put_price(num_sims, S, K, r, v, T); // Finally we output the parameters and prices std::cout << "Number of Paths: " << num_sims << std::endl; std::cout << "Underlying: " << S << std::endl; std::cout << "Strike: " << K << std::endl; std::cout << "Risk-Free Rate: " << r << std::endl; std::cout << "Volatility: " << v << std::endl; std::cout << "Maturity: " << T << std::endl; std::cout << "Call Price: " << call << std::endl; std::cout << "Put Price: " << put << std::endl; return 0;}The changes I have made seemed to increase the code running time by a second but I'm not entirely sure what I can change to stall the pipeline without adding code. A point to the right direction would be awesome, I appreciate any responses.Update: the professor who gave this assignment posted some detailsThe highlights are:It's a second semester architecture class at a community college (using the Hennessy and Patterson textbook).the lab computers have Haswell CPUsThe students have been exposed to the CPUID instruction and how to determine cache size, as well as intrinsics and the CLFLUSH instruction.any compiler options are allowed, and so is inline asm.Writing your own square root algorithm was announced as being outside the paleCowmoogun's comments on the meta thread indicate that it wasn't clear compiler optimizations could be part of this, and assumed -O0, and that a 17% increase in run-time was reasonable.So it sounds like the goal of the assignment was to get students to re-order the existing work to reduce instruction-level parallelism or things like that, but it's not a bad thing that people have delved deeper and learned more.Keep in mind that this is a computer-architecture question, not a question about how to make C++ slow in general. 解决方案 Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.Problems with the assignment wording and code:The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.I'd love to see an answer that attempted to get a slowdown from re-arranging the expressions to change the dependencies, to reduce ILP just from dependencies (hazards). I haven't attempted it.Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it's write-only. i.e. WAW being handled as a RAW hazard + a write). For memory ordering, modern CPUs use store queues to delay commit into cache until retirement, also avoiding WAR and WAW hazards.The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say "Core i7" when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls and ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes no sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)Summary of the good ideas that diabolical incompetence can justifyEven the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).False sharing for some other non-atomic variable -> memory-order mis-speculation pipeline clears, as well as extra cache misses.Instead of using - on FP variables, XOR the high byte with 0x80 to flip the sign bit, causing store-forwarding stalls.Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.Change multiplies by constants to divides by their reciprocal ("for ease of reading"). div is slow and not fully pipelined.Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different / worse asm.Multi-thread badlyMaybe use OpenMP to multi-thread loops with very few iterations, with way more overhead than speed gain. Your monte-carlo code has enough parallelism to actually get a speedup, though, esp. if we succeed at making each iteration slow. (Each thread computes a partial payoff_sum, added at the end). #omp parallel on that loop would probably be an optimization, not a pessimization.Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock inc. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (unless you use pause). True sharing without locked instructions is usually a bug. It would be interesting to compare a non-atomic shared loop counter with the atomic case. To really pessimize, keep the shared atomic loop counter, and cause false sharing in the same or a different cache line for some other variable.Random uarch-specific ideas:If you can introduce any unpredictable branches, that will pessimize the code substantially. Modern x86 CPUs have quite long pipelines, so a mispredict costs ~15 cycles (when running from the uop cache).Dependency chains:I think this was one of the intended parts of the assignment.Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.Use slower/more FP operations (esp. more division):Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).As @Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.If you can use intrinsics (<immintrin.h>)Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.Use integer shuffles between FP math operations to cause bypass delays.Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks @ZBoson).See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads axWith (inline) asm:With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.Use self-modifying code to trigger pipeline clears (aka machine-nukes).LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).Cause lots of cache misses and other memory slowdownsUse a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A very good compiler targeting FP math in SSE registers may possibly compile this to an xorps with a constant in another xmm register, but the only way this isn't terrible for x87 is if the compiler realizes that it's negating the value and replaces the next add with a subtract.)Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.Replace local vars with members of a big struct, so you can control the memory layout.Use arrays in the struct for padding (and storing random numbers, to justify their existence).Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset, so the check can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see an answer from Stephen Canon, and also a section near the end of @Krazy Glew's answer on the same question. (Andy Glew was one of the architects of Intel's original P6 microarchitecture.)Use __attribute__((packed)) to let you mis-align variables so they span cache-line or even page boundaries. (So a load of one double needs data from two cache-lines). Misaligned loads have no penalty in any Intel i7 uarch, except when crossing cache lines and page lines. Cache-line splits still take extra cycles. Skylake dramatically reduces the penalty for page split loads, from 100 to 5 cycles. (Section 2.1.3). Perhaps related to being able to do two page walks in parallel.A page-split on an atomic<uint64_t> should be just about the worst case, esp. if it's 5 bytes in one page and 3 bytes in the other page, or anything other than 4:4. Even splits down the middle are more efficient for cache-line splits with 16B vectors on some uarches, IIRC. Put everything in a alignas(4096) struct __attribute((packed)) (to save space, of course), including an array for storage for the RNG results. Achieve the misalignment by using uint8_t or uint16_t for something before the counter.If you can get the compiler to use indexed addressing modes, that will defeat uop micro-fusion. Maybe by using #defines to replace simple scalar variables with my_data[constant].If you can introduce an extra level of indirection, so load/store addresses aren't known early, that can pessimize further.Traverse arrays in non-contiguous orderI think we can come up with incompetent justification for introducing an array in the first place: It lets us separate the random number generation from the random number use. Results of each iteration could also be stored in an array, to be summed later (with more diabolical incompetence).For "maximum randomness", we could have a thread looping over the random array writing new random numbers into it. The thread consuming the random numbers could generate a random index to load a random number from. (There's some make-work here, but microarchitecturally it helps for load-addresses to be known early so any possible load latency can be resolved before the loaded data is needed.) Having a reader and writer on different cores will cause memory-ordering mis-speculation pipeline clears (as discussed earlier for the false-sharing case).For maximum pessimization, loop over your array with a stride of 4096 bytes (i.e. 512 doubles). e.g. for (int i=0 ; i<512; i++) for (int j=i ; j<UPPER_BOUND ; j+=512) monte_carlo_step(rng_array[j]);So the access pattern is 0, 4096, 8192, ...,8, 4104, 8200, ...16, 4112, 8208, ...This is what you'd get for accessing a 2D array like double rng_array[MAX_ROWS][512] in the wrong order (looping over rows, instead of columns within a row in the inner loop, as suggested by @JesperJuhl). If diabolical incompetence can justify a 2D array with dimensions like that, garden variety real-world incompetence easily justifies looping with the wrong access pattern. This happens in real code in real life.Adjust the loop bounds if necessary to use many different pages instead of reusing the same few pages, if the array isn't that big. Hardware prefetching doesn't work (as well/at all) across pages. The prefetcher can track one forward and one backward stream within each page (which is what happens here), but will only act on it if the memory bandwidth isn't already saturated with non-prefetch.This will also generate lots of TLB misses, unless the pages get merged into a hugepage (Linux does this opportunistically for anonymous (not file-backed) allocations like malloc/new that use mmap(MAP_ANONYMOUS)).Instead of an array to store the list of results, you could use a linked list. Then every iteration would require a pointer-chasing load (a RAW true dependency hazard for the load-address of the next load). With a bad allocator, you might manage to scatter the list nodes around in memory, defeating cache. With a diabolically incompetent allocator, it could put every node at the beginning of its own page. (e.g. allocate with mmap(MAP_ANONYMOUS) directly, without breaking up pages or tracking object sizes to properly support free).These aren't really microarchitecture-specific, and have little to do with the pipeline (most of these would also be a slowdown on a non-pipelined CPU).Somewhat off-topic: make the compiler generate worse code / do more work:Use C++11 std::atomic<int> and std::atomic<double> for the most pessimal code. The MFENCEs and locked instructions are quite slow even without contention from another thread.-m32 will make slower code, because x87 code will be worse than SSE2 code. The stack-based 32bit calling convention takes more instructions, and passes even FP args on the stack to functions like exp(). atomic<uint64_t>::operator++ on -m32 requires a lock cmpxchg8B loop (i586). (So use that for loop counters! [Evil laugh]).-march=i386 will also pessimize (thanks @Jesper). FP compares with fcom are slower than 686 fcomi. Pre-586 doesn't provide an atomic 64bit store, (let alone a cmpxchg), so all 64bit atomic ops compile to libgcc function calls (which is probably compiled for i686, rather than actually using a lock). Try it on the Godbolt Compiler Explorer link in the last paragraph.Use long double / sqrtl / expl for extra precision and extra slowness in ABIs where sizeof(long double) is 10 or 16 (with padding for alignment). (IIRC, 64bit Windows uses 8byte long double equivalent to double. (Anyway, load/store of 10byte (80bit) FP operands is 4 / 7 uops, vs. float or double only taking 1 uop each for fld m64/m32/fst). Forcing x87 with long double defeats auto-vectorization even for gcc -m64 -march=haswell -O3.If not using atomic<uint64_t> loop counters, use long double for everything, including loop counters.atomic<double> compiles, but read-modify-write operations like += aren't supported for it (even on 64bit). atomic<long double> has to call a library function just for atomic loads/stores. It's probably really inefficient, because the x86 ISA doesn't naturally support atomic 10byte loads/stores, and the only way I can think of without locking (cmpxchg16b) requires 64bit mode.At -O0, breaking up a big expression by assigning parts to temporary vars will cause more store/reloads. Without volatile or something, this won't matter with optimization settings that a real build of real code would use.C aliasing rules allow a char to alias anything, so storing through a char* forces the compiler to store/reload everything before/after the byte-store, even at -O3. (This is a problem for auto-vectorizing code that operates on an array of uint8_t, for example.)Try uint16_t loop counters, to force truncation to 16bit, probably by using 16bit operand-size (potential stalls) and/or extra movzx instructions (safe). Signed overflow is undefined behaviour, so unless you use -fwrapv or at least -fno-strict-overflow, signed loop counters don't have to be re-sign-extended every iteration, even if used as offsets to 64bit pointers.Force conversion from integer to float and back again. And/or double<=>float conversions. The instructions have greater-than-one latency, and scalar int->float (cvtsi2ss) is badly designed to not zero the rest of the xmm register. (gcc inserts an extra pxor to break dependencies, for this reason.)Frequently set your CPU affinity to a different CPU (suggested by @Egwor). diabolical reasoning: You don't want one core to get overheated from running your thread for a long time, do you? Maybe swapping to another core will let that core turbo to a higher clock speed. (In reality: they're so thermally close to each other that this is highly unlikely except in a multi-socket system). Now just get the tuning wrong and do it way too often. Besides the time spent in the OS saving/restoring thread state, the new core has cold L2/L1 caches, uop cache, and branch predictors.Introducing frequent unnecessary system calls can slow you down no matter what they are. Although some important but simple ones like gettimeofday may be implemented in user-space with, with no transition to kernel mode. (glibc on Linux does this with the kernel's help, since the kernel exports code in the vdso).For more on system call overhead (including cache/TLB misses after returning to user-space, not just the context switch itself), the FlexSC paper has some great perf-counter analysis of the current situation, as well as a proposal for batching system calls from massively multi-threaded server processes. 这篇关于在英特尔Sandybridge家族CPU中为管线优化程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 1403页,肝出来的..
09-06 23:37