本文介绍了为memcpy增强了REP MOVSB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我想使用增强型REP MOVSB(ERMSB)为自定义 memcpy 获取高带宽。 ERMSB是在Ivy Bridge微架构中引入的。请参阅英特尔优化手册,如果您不知道ERMSB是什么。 我知道直接做这件事的方法是使用内联汇编。我从 https://groups.google。获得了以下功能。 com / forum /#!topic / gnu.gcc.help / -Bmlm_EG_fE static inline void * __ movsb( void * d,const void * s,size_t n){ asm volatile(rep movsb:= D(d),= S(s),= c(n):0(d),1(s),2(n): ); return d; } 然而,当我使用这个时,带宽远远小于的memcpy 。 __ movsb 获得15 GB / s和 memcpy 用我的i7-6700HQ(Skylake )系统,Ubuntu 16.10,DDR4 @ 2400 MHz双通道32 GB,GCC 6.2。 为什么带宽如此低以及 REP MOVSB ?我可以做些什么来改善它? 这是我用来测试这个的代码。 // gcc -O3 -march = native -fopenmp foo.c #include #include< string.h> #include< stdio.h> #include< stddef.h> #include #include< x86intrin.h> static inline void * __ movsb(void * d,const void * s,size_t n){ asm volatile(rep movsb:= D ,= S(s),= c(n):0(d),1(s),2(n):记忆体); return d; } int main(void){ int n = 1 // char * a = malloc(n),* b = malloc(n); char * a = _mm_malloc(n,4096),* b = _mm_malloc(n,4096); memset(a,2,n),memset(b,1,n); __movsb(b,a,n); printf(%d \ n,memcmp(b,a,n)); double dtime; dtime = -omp_get_wtime(); for(int i = 0; i dtime + = omp_get_wtime(); printf(dtime%f,%.2f GB / s \ n,dtime,2.0 * 10 * 1E-9 * n / dtime); dtime = -omp_get_wtime(); for(int i = 0; i dtime + = omp_get_wtime(); printf(dtime%f,%.2f GB / s \ n,dtime,2.0 * 10 * 1E-9 * n / dtime); 我感兴趣的原因在 rep movsb 基于这些评论 请注意,在Ivybridge和Haswell,缓冲区大到适合MLC,你可以使用rep movsb打败movntdqa; movntdqa招致RFO进入LLC,rep movsb不... 当Moviqq流向Ivybridge和Haswell的内存时,它的速度明显快于movntdqa(但是请注意,Ivybridge之前的速度很慢!) $ b 这里是我的结果来自 tinymembnech 。 C向后复制:7910.6 MB / s(1.4%) C向后复制(32字节块):7696.6 MB / s(0.9%) C向后复制(64字节块):7679.5 MB / s(0.7%)拷贝数:8811.0 MB / s(1.2%)预取的C拷贝(32字节步长) :9328.4 MB / s(0.5%) C复制预取(64字节步长):9355.1 MB / s(0.6%) C双向复制:6474.3 MB / s(1.3%) C 2-pass copy prefetched(32字节步长):7072.9 MB / s(1.2%) C预读双向拷贝(64字节步长):7065.2 MB / s(0.8%) C填充:14426.0 MB / s(1.5%) C填充(在16字节块内混洗):14198.0 MB / s(1.1%) C填充(在32字节块内混洗):14422.0 MB / s(1.7%) C填充(在64字节块内混洗):14178.3 MB / s(1.0%) --- 标准memcpy:12784.4 MB / s(1.9%)标准memset:30630.3 MB / s(1.1%) --- MOVSB拷贝:8712.0 MB / s(2.0%) MOVSD拷贝:8712.7 MB / s(1.9%) SSE2拷贝:8952.2 MB / s(0.7%) SSE2非颞拷贝:12538.2 MB / s(0.8%) SSE2拷贝预取步骤):9553.6 MB / s(0.8%)预取的SSE2拷贝(64字节步长):9458.5 MB / s(0.5%)预取的SSE2非时间拷贝(32字节步长):13103.2 MB / s (0.7%) SSE2非颞拷贝预取(64字节步长):13179.1 MB / s(0.9%) SSE2双向拷贝:7250.6 MB / s(0.7%) SSE2 2预读(-32字节步长):7437.8 MB / s(0.6%) SSE2预读双向拷贝(64字节步长):7498.2 MB / s(0.9%) SSE2 2-pass非时间拷贝:3776.6 MB / s(1.4%) SSE2填充:14701.3 MB / s(1.6%) SSE2 nontemporal fill :34188.3 MB / s(0.8%) 请注意,在我的系统 SSE2复制预取也快于 MOVSB副本。 在我的原始测试中,我没有禁用turbo。我禁用了turbo并再次测试,看起来没什么区别。但是,改变电源管理的确会产生很大的影响。 当我做的时候 sudo cpufreq-set -r -g performance 我有时会看到超过20 GB / s与 rep movsb 。 with sudo cpufreq-set -r -g powersave 我看到的最好的大约17 GB / s。但是 memcpy 似乎对电源管理并不敏感。 我检查了频率(使用 turbostat )启用或不启用SpeedStep , performance 和 powersave 用于空闲,1核心负载和4核心负载。我运行英特尔的MKL密集矩阵乘法来创建加载并使用 OMP_SET_NUM_THREADS 设置线程数。 SpeedStep空闲1核心4核心 powersave OFF 0.8 2.6 2.6 性能关闭2.6 2.6 2.6 $ b $ powersave ON 0.8 3.5 3.1 性能开3.5 3.5 3.1 这表明,即使在禁用SpeedStep的情况下,使用 powersave ,CPU 仍然会降至空闲频率 0.8 GHz 。只有 performance 没有SpeedStep时,CPU以恒定频率运行。 我使用 sudo cpufreq-set -r performance (因为 cpufreq-set 会给出奇怪的结果)来更改电源设置。这将打开涡轮增压,所以我不得不禁用涡轮后。 解决方案这是一个非常接近我的心脏和最近的调查的话题,所以我会从几个角度来看它:历史,一些技术笔记(主要是学术),我的盒子上的测试结果,以及最后尝试回答您实际问题的时间和地点 rep movsb 可能有意义。 部分地,这是分享结果调用 - 如果您可以运行 Tinymembench ,并分享结果以及你的CPU和RAM配置的细节,那将会很棒。特别是如果你有4通道设置,Ivy Bridge盒子,服务器盒子等等。 $ b 快速字符串复制指令的执行历史已经有一段阶段性的事情 - 即停滞的性能时期交替进行大幅升级,使其与竞争方法相比甚至更快。例如,Nehalem的性能(主要针对启动开销)和Ivy Bridge(大多数针对大型副本的总吞吐量)都有所提升。您可以在英特尔工程师 rep movs 指令的困难的十年陈述。我们/论坛/ intel-visual-fortran-compiler-for-windows / topic / 275765#comment-1551057rel =noreferrer> in this thread 。 例如,在引入Ivy Bridge之前的指南中,典型的建议是为了避免它们或者非常使用它们仔细地 1 。 目前(嗯,2016年6月)指南有许多令人困惑和有些不一致的建议,例如 2 $ b 实现的具体变体在执行时间中根据数据布局,对齐和计数器(ECX)值。例如,对于的例子,带有REP前缀的MOVSB / STOSB应与计数器值小于或等于三以获得最佳性能。 那么对于3个或更少字节的副本?你首先不需要一个 rep 前缀,因为声称的启动延迟约为9个周期,你几乎可以肯定用一个简单的DWORD或QWORD mov 用一些位扭曲来掩盖未使用的字节(或者用2个明确的字节,字 mov 如果你知道这个尺寸恰好是3)。 他们继续说: 字符串MOVE / STORE指令具有多个数据粒度。对于有效的数据移动,更大的数据粒度更可取。 这意味着可以通过将任意计数器值分解为多个双字加单字节的移动并将计数值小于或等于3来实现更高的效率。 在当前带有ERMSB的硬件中,这当然似乎是错误的,其中 rep movsb 至少一样快,或更快,比 movd 或 movq 变体适合大型副本。 一般来说,本指南的第3.7.5节包含了合理和严重陈旧的建议。这是英特尔手册的常见吞吐量,因为它们针对每种架构都以增量方式进行更新(即使在当前手册中,它也包含近二十年的架构),旧版本通常不会更新以替代或提供条件性建议这不适用于当前的体系结构。然后他们继续在3.7.6节中介绍ERMSB。 我不会详细讨论其余的建议,但我会在下面的为什么使用它中总结出好的部分。 其他重要的声明从Haswell的指南可以看出, rep movsb 已经得到增强,可以在内部使用256位操作。 技术技术注意事项 这只是 >执行立场。 rep movs的优点 当发出 rep movs指令时,CPU 知道整个块已知的尺寸将被转移。这可以帮助其优化操作,但不能用分立的指令进行操作,例如: 当它知道整个缓存行将被覆盖。 立即准确地发出预取​​请求。硬件预取功能在检测 memcpy 类型模式方面做得很好,但是它仍然需要一些读取才能启动,并且会超出预取端口的复制区域。 rep movsb 确切地知道区域的大小,并且可以准确地预取。 显然,不能保证在 3 中的单个商店之间订购单个 rep movs ,这可以帮助简化一致性流量并简化其他方面块移动,而不是简单的 mov 指令,这些指令必须服从相当严格的内存排序 4 。原则上, rep movs 指令可以利用各种不被暴露的架构技巧在ISA。例如,体系结构可能有更广泛的内部数据路径,ISA公开的 5 和 rep movs 可以在内部使用它。 缺点 rep movsb 必须实现可能比底层软件要求更强的特定语义。特别是, memcpy 禁止重叠区域,所以可能忽略这种可能性,但是 rep movsb 允许它们并且必须产生预期结果。目前的实现主要影响启动开销,但可能不影响大块吞吐量。同样,即使你真的用它来复制大块数为2的倍数的大块, rep movsb 也必须支持字节粒度拷贝。 如果使用 rep movsb ,软件可能会提供有关对齐,复制大小和可能的别名的信息, 。编译器通常可以确定内存块的对齐方式,因此可以避免 / em>调用。 测试结果 是来自 tinymembench 的许多不同复制方法的测试结果。在2.6 GHz的i7-6700HQ上(太糟糕了,我有相同的CPU,所以我们没有得到新的数据点......): C向后复制:8284.8 MB / s(0.3%)向后复制(32字节块):8273.9 MB / s(0.4%) C向后复制(64字节块) :8321.9 MB / s(0.8%) C复制:8863.1 MB / s(0.3%) C复制预取(32字节步长):8900.8 MB / s(0.3%) C复制预取(64字节步骤): 8817.5 MB / s(0.5%) C双向拷贝:6492.3 MB / s(0.3%) C预读双向拷贝(32字节步长):6516.0 MB / s(2.4%) C预读2-pass副本(64字节步长):6520.5 MB / s(1.2%) --- 标准memcpy:12169.8 MB / s(3.4%)标准memset:23479.9 MB / s(4.2%) --- MOVSB拷贝:10197.7 MB / s(1.6%) MOVSD拷贝:10177.6 MB / s(1.6%) SSE2拷贝:8973.3 MB / s(2.5%) SSE2非颞拷贝:12924.0 MB / s(1.7%) SSE2拷贝预取(32字节步长):9014.2 MB / s(2.7%) SSE2副本预取(64字节步长):8964.5 MB / s(2.3%) SSE2预取非时间复制(32字节步长) :11777.2 MB / s(5.6%) SSE2非颞拷贝预取(64字节步长):11826.8 MB / s(3.2%) SSE2双向拷贝:7529.5 MB / s(1.8%) SSE2预读2步拷贝(32字节步):7122.5 MB / s(1.0%) SSE2预读2步拷贝(64字节步):7214.9 MB / s(1.4%) SSE2 2-pass非摹本:4987.0 MB / s 一些关键要点: rep movs 方法比所有其他非非暂时性 7 ,并且比每次复制8个字节的C方法快得多。 non-temporal方法速度更快,高达比 rep movs ones大约26% - 但是这比你报告的增量要小得多(26 GB / s vs 15 GB / s =〜73%)。 如果您不使用非临时存储,则使用C的8字节副本非常多就像128位宽的SSE加载/存储一样。这是因为良好的复制循环可以产生足够的内存压力来使带宽饱和(例如,商店的2.6 GHz * 1存储/周期* 8字节= 26 GB /秒)。 在tinymembench中没有明确的256位算法(可能除了标准 memcpy ),但由于上面的注释可能并不重要。 非临时存储方法在时间存储方法上的吞吐量增加了大约1.45倍,这非常接近您预期的1.5倍,如果NT消除3个传输中的1个(即1个读取,1个写入对于NT与2个阅读,1个写作)。 rep movs 方法位于中间。 相当低的内存延迟和适度的2通道带宽的组合意味着这个特定的芯片恰巧能够从单线程饱和它的内存带宽,这大大改变了行为。 rep movsd 似乎正在使用与芯片上的 rep movsb 相同。这很有趣,因为ERMSB只明确地以 movsb 为目标,并且早期的ERMSB的arch上的测试显示 movsb 的执行速度比 MOVSD 。由于 movsb 比 movsd 更普遍。 > Haswell 查看 Haswell结果由iwillnotexist在评论中友善地提供,我们看到相同的一般趋势(提取出最相关的结果): C副本:6777.8 MB / s(0.4%)标准memcpy:10487.3 MB / s(0.5%) MOVSB副本:9393.9 MB / s(0.2%) MOVSD副本:9155.0 MB / s(1.6%) SSE2拷贝:6780.5 MB / s(0.4%) SSE2非时间拷贝:10688.2 MB / s(0.3%) 代表movsb 方法仍然比非时间的memcpy ,但这里只有约14%(相比Skylake测试约26%)。 NT技术在他们时间表兄弟之上的优势现在约为57%,甚至比带宽减少的理论收益还要多一点。 什么时候应该使用 rep movs ? 最后刺穿您的实际问题:何时或为什么要使用它?它借鉴以上介绍了一些新的想法。不幸的是,没有简单的答案:你必须权衡各种因素,包括一些你可能甚至无法确切知道的因素,比如未来的发展。 A请注意,替代 rep movsb 可能是优化的libc memcpy (包括编译器内联的副本),或者它可能是手动推出的 memcpy 版本。下面的一些好处仅适用于与这些替代方案中的一个或另一个相比较(例如,简单有助于防止手动滚动版本,但不适用于内置 memcpy ),但一些适用于两者。 可用指令的限制 在某些环境中,限制某些指令或使用某些寄存器。例如,在Linux内核中,通常不允许使用SSE / AVX或FP寄存器。因此,大多数优化的 memcpy 变体都不能使用,因为它们依赖于SSE或AVX寄存器,而纯64位 mov rep movsb 可以在不打破SIMD代码限制的前提下,优化 memcpy 的大部分性能。 更通用的例子可能是代码必须针对多代硬件,并且不使用硬件特定的分派(例如,使用 CPUID )。在这里你可能会被迫只使用较老的指令集,这就排除了任何AVX等。 rep movsb 可能是一个很好的方法,因为它允许隐藏访问更宽加载和存储而不使用新的说明。如果你的目标是预ERMSB硬件,你必须看看 rep movsb 的性能是否可以接受,不过...... 未来打样 rep movsb 的一个很好的方面是它可以在理论上 / em>利用未来架构上的架构改进,无需更改源,明确的移动不能。例如,当引入256位数据路径时, rep movsb 能够利用它们(如英特尔声称),而无需对软件进行任何更改。使用128位移动的软件(这在Haswell之前是最佳的)必须进行修改和重新编译。 因此,这既是软件维护的好处(无需更改源代码)和现有二进制文件的好处(无需部署新的二进制文件来利用改进)。 这取决于您的维护模型(例如,在实践中部署新的二进制文件的频率)以及很难判断这些指令在未来可能会有多快。至少英特尔在这方面是一种指导用途,至少在将来承诺 合理性能( 15.3.3.6 ): REP MOVSB和REP STOSB将继续在未来处理器上运行良好。 与后续工作重叠 这个好处不会显示在简单的 memcpy 基准当然没有后续工作的重叠,因此在实际情况下必须仔细衡量好处的重要性。充分利用可能需要重新组织代码围绕 memcpy 。 这个好处是Intel在他们的优化手册(第11.16.3.4节)和他们的话中: 当已知计数至少为一千字节或更多,使用增强型REP MOVSB / STOSB可以提供另一个优势,以摊分非消费代码的成本。启发式可以理解为,使用值Cnt = 4096和memset()作为示例: memset()的256位SIMD实现将需要发出/执行在之前退出带有VMOVDQA的32字节存储操作的128个实例,非消费指令序列可以通过退休。 •ECX = 4096的增强型REP STOSB实例被解码为由硬件提供的长的微操作流,但作为一条指令退出。有许多store_data操作必须在memset()的结果被消耗之前完成。因为商店数据操作的完成从程序订单退出中分离出来,所以非消费码流的主要部分可以通过处理问题/执行和退休,本质上如果非消费序列没有竞争存储缓冲区资源,则免费。 因此,Intel在说所有的一些代码都会在 rep movsb 已经发布的代码中使用代码,但是当大量商店仍在运行时,代表movsb 作为一个整体还没有退休,跟随指令的微软可以通过无序的机制取得更多的进展,而不是在复制循环之后得到的代码。 显式加载和存储循环中的uop都必须按程序顺序实际单独退出。这必须发生,以便在ROB中为后续uops腾出空间。 似乎没有关于如何非常长的微码指令的详细信息,例如 rep movsb 正常工作。我们不确切知道微码分支如何从微码定序器请求不同的uops流,或者uops如何退出。如果单个uops不必单独退出,也许整个指令只占用ROB中的一个槽位? 当提供OoO机器的前端在uop缓存中看到一个 rep movsb 指令,它激活微码定序器ROM(MS-ROM),将微码uops发送到提供问题/重命名阶段的队列中。在 rep movsb 仍在发布的情况下,其他uops可能无法与其混合并发出/执行 8 ,但后续的指令可能是获取/解码并在最后一个 rep movsb uop之后立即执行,而一些副本尚未执行。 这仅在至少一些后续代码不依赖于 memcpy 的结果(这并不罕见)时才有用。 现在,这个好处的大小是有限的:至多你可以执行N个指令(实际上是uops),超过缓慢的 rep movsb 指令,此时你会停顿,其中N是 ROB大小。当前的ROB大小约为200(Haswell为192,Skylake为224),对于IPC为1的后续代码,最大收益约为200个自由工作周期。在200个周期内,您可以在10 GB的地方复制大约800字节的某处/ s,因此对于这种大小的副本,您可以免费获得接近副本成本的免费工作(以免费副本的方式)。 然而,随着副本大小的增加,相对重要性会迅速下降(例如,如果您要复制80 KB,则免费工作只有1%复印费用)。不过,对于中等大小的副本来说,这是相当有趣的。 复制循环不会完全阻止执行的后续指令。英特尔没有详细说明受益的规模,或者哪种副本或周边代码最有利。 (热或冷目的地或来源,高ILP或低ILP高延迟代码)。 代码大小 memcpy 例程相比,执行的代码大小(几个字节)是微观的。如果性能完全受限于i-cache(包括uop cache)的缺失,那么缩小的代码大小可能会有所帮助。再次,我们可以限制这个数量级基于副本大小的好处。实际上我不会用数字来解决这个问题,但直觉是将动态代码大小减少B字节可以节省至多 C * B 缓存未命中,对于某些常量C.对 memcpy 的每个调用都会导致缓存缺失成本(或收益)一次,但吞吐量更高的优势会随着复制的字节数。所以对于大的传输,更高的吞吐量将主宰缓存效应。 再一次,这不是显示在一个普通的基准测试中,整个循环毫无疑问适合uop缓存。您需要一个真实世界的就地测试来评估这种效果。 架构特定优化 您报告说在您的硬件上, rep movsb 比平台 memcpy 慢得多。然而,即使在这里,在早期的硬件(如Ivy Bridge)上也有相反的结果报道。 这完全合理,因为字符串移动操作看起来像是周期性的 - 但不是每一代人,所以它可能会更快或者至少在它已经被更新的体系结构上被绑定(在这一点上它可能会赢得其他优势),只是在随后的硬件中落后。 p> 引用 a> Andy Glew,在P6上实现这些后应该知道一两件事: b $ b 做快速字符串的大缺点在微代码中,微代码与每一代都失调,变得越来越慢,越来越慢,直到有人开始修复它。就像一个图书馆男人复制失败。我猜想有可能错过的机会之一是在可用时使用128位加载和存储,等等。 在这种情况下,它可以被看作是另一种特定于平台的优化,适用于典型的每一个窍门 memcpy 在标准库和JIT编译器中找到的例程:但仅适用于更好的体系结构。对于JIT或AOT编译的东西来说这很容易,但对于静态编译的二进制文件,这确实需要特定于平台的调度,但通常已经存在(有时在链接时实现)或 mtune 参数可用于做出静态决定。 Simplicity Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that). Latency Bound Platforms Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound. The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK. You won’t sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems). The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you’d actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth. In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums. Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel \"client\" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them. Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again. Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren’t on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don’t use NT stores lose 1/3 of their bandwidth. If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don’t help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect. So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb wins where (if it gets the best of both worlds). Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don’t know how rep movsb will perform relatively... References Other good sources of info not integrated in the above. comp.arch investigation of rep movsb versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I’ve often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes). 1 Presumably the intention is to restrict it to cases where, for example, code-size is very important. 2 See Section 3.7.5: REP Prefix and Data Movement. 3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs out of order with respect to each other but not with respect to prior or subsequent stores (and it’s the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store. 4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs has even more freedom since there are still some ordering constraints on WC/NT stores. 5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double type). Today, \"neutered\" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs microcode can still use 256b loads/stores. 6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can’t be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks. 7 I’m making the assumption that \"standard\" memcpy is choosing a non-temporal approach, which is highly likely for this size of buffer. 8 That isn’t necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb simply monopolizes dispatch and then it would look very much like the explicit mov case. It seems that it doesn’t work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb. 9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy would be a poster child (and as apposed to purely latency bound loads such as pointer chasing). I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fEstatic inline void *__movsb(void *d, const void *s, size_t n) { asm volatile ("rep movsb" : "=D" (d), "=S" (s), "=c" (n) : "0" (d), "1" (s), "2" (n) : "memory"); return d;}When I use this however, the bandwidth is much less than with memcpy.__movsb gets 15 GB/s and memcpy get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4@2400 MHz dual channel 32 GB, GCC 6.2.Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?Here is the code I used to test this. //gcc -O3 -march=native -fopenmp foo.c#include <stdlib.h>#include <string.h>#include <stdio.h>#include <stddef.h>#include <omp.h>#include <x86intrin.h>static inline void *__movsb(void *d, const void *s, size_t n) { asm volatile ("rep movsb" : "=D" (d), "=S" (s), "=c" (n) : "0" (d), "1" (s), "2" (n) : "memory"); return d;}int main(void) { int n = 1<<30; //char *a = malloc(n), *b = malloc(n); char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096); memset(a,2,n), memset(b,1,n); __movsb(b,a,n); printf("%d\n", memcmp(b,a,n)); double dtime; dtime = -omp_get_wtime(); for(int i=0; i<10; i++) __movsb(b,a,n); dtime += omp_get_wtime(); printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime); dtime = -omp_get_wtime(); for(int i=0; i<10; i++) memcpy(b,a,n); dtime += omp_get_wtime(); printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);}The reason I am interested in rep movsb is based off these comments Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not... rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)What's missing/sub-optimal in this memcpy implementation?Here are my results on the same system from tinymembnech. C copy backwards : 7910.6 MB/s (1.4%) C copy backwards (32 byte blocks) : 7696.6 MB/s (0.9%) C copy backwards (64 byte blocks) : 7679.5 MB/s (0.7%) C copy : 8811.0 MB/s (1.2%) C copy prefetched (32 bytes step) : 9328.4 MB/s (0.5%) C copy prefetched (64 bytes step) : 9355.1 MB/s (0.6%) C 2-pass copy : 6474.3 MB/s (1.3%) C 2-pass copy prefetched (32 bytes step) : 7072.9 MB/s (1.2%) C 2-pass copy prefetched (64 bytes step) : 7065.2 MB/s (0.8%) C fill : 14426.0 MB/s (1.5%) C fill (shuffle within 16 byte blocks) : 14198.0 MB/s (1.1%) C fill (shuffle within 32 byte blocks) : 14422.0 MB/s (1.7%) C fill (shuffle within 64 byte blocks) : 14178.3 MB/s (1.0%) --- standard memcpy : 12784.4 MB/s (1.9%) standard memset : 30630.3 MB/s (1.1%) --- MOVSB copy : 8712.0 MB/s (2.0%) MOVSD copy : 8712.7 MB/s (1.9%) SSE2 copy : 8952.2 MB/s (0.7%) SSE2 nontemporal copy : 12538.2 MB/s (0.8%) SSE2 copy prefetched (32 bytes step) : 9553.6 MB/s (0.8%) SSE2 copy prefetched (64 bytes step) : 9458.5 MB/s (0.5%) SSE2 nontemporal copy prefetched (32 bytes step) : 13103.2 MB/s (0.7%) SSE2 nontemporal copy prefetched (64 bytes step) : 13179.1 MB/s (0.9%) SSE2 2-pass copy : 7250.6 MB/s (0.7%) SSE2 2-pass copy prefetched (32 bytes step) : 7437.8 MB/s (0.6%) SSE2 2-pass copy prefetched (64 bytes step) : 7498.2 MB/s (0.9%) SSE2 2-pass nontemporal copy : 3776.6 MB/s (1.4%) SSE2 fill : 14701.3 MB/s (1.6%) SSE2 nontemporal fill : 34188.3 MB/s (0.8%)Note that on my system SSE2 copy prefetched is also faster than MOVSB copy.In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.When I do sudo cpufreq-set -r -g performanceI sometimes see over 20 GB/s with rep movsb.with sudo cpufreq-set -r -g powersavethe best I see is about 17 GB/s. But memcpy does not seem to be sensitive to the power management.I checked the frequency (using turbostat) with and without SpeedStep enabled, with performance and with powersave for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS. Here is a table of the results (numbers in GHz). SpeedStep idle 1 core 4 corepowersave OFF 0.8 2.6 2.6performance OFF 2.6 2.6 2.6powersave ON 0.8 3.5 3.1performance ON 3.5 3.5 3.1This shows that with powersave even with SpeedStep disabled the CPUstill clocks down to the idle frequency of 0.8 GHz. It's only with performance without SpeedStep that the CPU runs at a constant frequency.I used e.g sudo cpufreq-set -r performance (because cpufreq-set was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after. 解决方案 This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where rep movsb might make sense.Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.History and Official AdviceThe performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the rep movs instructions from an Intel engineer in this thread.For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2: The specific variant of the implementation is chosen at execution time based on data layout, alignment and the counter (ECX) value. For example, MOVSB/STOSB with the REP prefix should be used with counter value less than or equal to three for best performance.So for copies of 3 or less bytes? You don't need a rep prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word movs if you know the size is exactly three).They go on to say: String MOVE/STORE instructions have multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary counter value into a number of double words plus single byte moves with a count value less than or equal to 3.This certainly seems wrong on current hardware with ERMSB where rep movsb is at least as fast, or faster, than the movd or movq variants for large copies.In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.They then go on to cover ERMSB explicitly in section 3.7.6.I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.Other important claims from the guide are that on Haswell, rep movsb has been enhanced to use 256-bit operations internally.Technical ConsiderationsThis is just a quick summary of the underlying advantages and disadvantages that the rep instructions have from an implementation standpoint.Advantages for rep movsWhen a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:Avoiding the RFO request when it knows the entire cache line will be overwritten.Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.Apparently, there is no guarantee of ordering among the stores within3 a single rep movs which can help simplify coherency traffic and simply other aspects of the block move, versus simple mov instructions which have to obey rather strict memory ordering4.In principle, the rep movs instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 and rep movs could use that internally.Disadvantagesrep movsb must implement a specific semantic which may be stronger than the underlying software requirement. In particular, memcpy forbids overlapping regions, and so may ignore that possibility, but rep movsb allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly, rep movsb must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs must do on every invocation.Test ResultsHere are test results for many different copy methods from tinymembench on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...): C copy backwards : 8284.8 MB/s (0.3%) C copy backwards (32 byte blocks) : 8273.9 MB/s (0.4%) C copy backwards (64 byte blocks) : 8321.9 MB/s (0.8%) C copy : 8863.1 MB/s (0.3%) C copy prefetched (32 bytes step) : 8900.8 MB/s (0.3%) C copy prefetched (64 bytes step) : 8817.5 MB/s (0.5%) C 2-pass copy : 6492.3 MB/s (0.3%) C 2-pass copy prefetched (32 bytes step) : 6516.0 MB/s (2.4%) C 2-pass copy prefetched (64 bytes step) : 6520.5 MB/s (1.2%) --- standard memcpy : 12169.8 MB/s (3.4%) standard memset : 23479.9 MB/s (4.2%) --- MOVSB copy : 10197.7 MB/s (1.6%) MOVSD copy : 10177.6 MB/s (1.6%) SSE2 copy : 8973.3 MB/s (2.5%) SSE2 nontemporal copy : 12924.0 MB/s (1.7%) SSE2 copy prefetched (32 bytes step) : 9014.2 MB/s (2.7%) SSE2 copy prefetched (64 bytes step) : 8964.5 MB/s (2.3%) SSE2 nontemporal copy prefetched (32 bytes step) : 11777.2 MB/s (5.6%) SSE2 nontemporal copy prefetched (64 bytes step) : 11826.8 MB/s (3.2%) SSE2 2-pass copy : 7529.5 MB/s (1.8%) SSE2 2-pass copy prefetched (32 bytes step) : 7122.5 MB/s (1.0%) SSE2 2-pass copy prefetched (64 bytes step) : 7214.9 MB/s (1.4%) SSE2 2-pass nontemporal copy : 4987.0 MB/sSome key takeaways:The rep movs methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.The "non-temporal" methods are faster, by up to about 26% than the rep movs ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz * 1 store/cycle * 8 bytes = 26 GB/s for stores).There are no explicit 256-bit algorithms in tinymembench (except probably the "standard" memcpy) but it probably doesn't matter due to the above note.The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.rep movsd seems to use the same magic as rep movsb on this chip. That's interesting because ERMSB only explicitly targets movsb and earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. This is mostly academic since movsb is more general than movsd anyway.HaswellLooking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted): C copy : 6777.8 MB/s (0.4%) standard memcpy : 10487.3 MB/s (0.5%) MOVSB copy : 9393.9 MB/s (0.2%) MOVSD copy : 9155.0 MB/s (1.6%) SSE2 copy : 6780.5 MB/s (0.4%) SSE2 nontemporal copy : 10688.2 MB/s (0.3%)The rep movsb approach is still slower than the non-temporal memcpy, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.When should you use rep movs?Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.A note that the alternative to rep movsb may be the optimized libc memcpy (including copies inlined by the compiler), or it may be a hand-rolled memcpy version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy), but some apply to both.Restrictions on available instructionsIn some environments there there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov-based copy is used on x86. For these platforms, using rep movsb allows most of the performance of an optimized memcpy without breaking the restriction on SIMD code.A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb performance is acceptable there, though...Future ProofingA nice aspect of rep movsb is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a the very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6): REP MOVSB and REP STOSB will continue to perform reasonably well on future processors.Overlapping with subsequent workThis benefit won't show up in a plain memcpy benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy.This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words: When the count is known to be at least a thousand byte or more, using enhanced REP MOVSB/STOSB can provide another advantage to amortize the cost of the non-consuming code. The heuristic can be understood using a value of Cnt = 4096 and memset() as example: • A 256-bit SIMD implementation of memset() will need to issue/execute retire 128 instances of 32- byte store operation with VMOVDQA, before the non-consuming instruction sequences can make their way to retirement. • An instance of enhanced REP STOSB with ECX= 4096 is decoded as a long micro-op flow provided by hardware, but retires as one instruction. There are many store_data operation that must complete before the result of memset() can be consumed. Because the completion of store data operation is de-coupled from program-order retirement, a substantial part of the non-consuming code stream can process through the issue/execute and retirement, essentially cost-free if the non-consuming sequence does not compete for store buffer resources.So Intel is saying that after all some uops the code after rep movsb has issued, but while lots of stores are still in flight and the rep movsb as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?When the front-end that feeds the OoO machinery sees a rep movsb instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb uop does, while some of the copy hasn't executed yet. This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy (which isn't unusual).Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).Code SizeThe executed code size (a few bytes) is microscopic compared to a typical optimized memcpy routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B cache-misses, for some constant C. Every call to memcpy incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.Architecture Specific OptimizationYou reported that on your hardware, rep movsb was considerably slower than the platform memcpy. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6: the big weakness of doing fast strings in microcode was [...] the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on.In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune argument can be used to make a static decision.SimplicityEven on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).Latency Bound PlatformsMemory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb wins where (if it gets the best of both worlds).Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how rep movsb will perform relatively...ReferencesOther good sources of info not integrated in the above.comp.arch investigation of rep movsb versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.2 See Section 3.7.5: REP Prefix and Data Movement.3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs has even more freedom since there are still some ordering constraints on WC/NT stores.5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs microcode can still use 256b loads/stores.6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.7 I'm making the assumption that "standard" memcpy is choosing a non-temporal approach, which is highly likely for this size of buffer.8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb simply monopolizes dispatch and then it would look very much like the explicit mov case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb.9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy would be a poster child (and as apposed to purely latency bound loads such as pointer chasing). 这篇关于为memcpy增强了REP MOVSB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-20 18:27