无论如何,使用bt [mem], reg代替bt reg, reg是主要的优化遗漏. 1 uop,1c延迟,每时钟2个吞吐量bt r9d, ebx,此循环会比其他循环快. 内部循环编译为并添加(向左移动)和sar.嗯?这些就是MSVC与curBit <<= 1;源代码行相关联的指令(即使该行完全由add self,self实现,并且可变计数算术右移是另一行的一部分.)但是整个循环都是笨拙的混乱: long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; }$LL18@main: # MSVC CL19 -Ox mov ecx, ebx ; 1 uop lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits mov eax, r9d ; 1 uop. r9d holds num inc ebx ; 1 uop and eax, edx ; 1 uop # MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not. add edx, edx ; 1 uop sar eax, cl ; 3 uops (variable-count shifts suck) mov BYTE PTR [r8-1], al ; 1 uop (micro-fused) cmp ebx, 31 jb SHORT $LL18@main ; 1 uop (macro-fused with cmp)因此,这是11个融合域的uops,每次迭代需要2.75个时钟周期才能从前端发出.我看不到任何循环承载的dep链长于该前端瓶颈,因此它运行的速度可能如此之快.每次迭代都将ebx复制到ecx而不是仅将ecx作为循环计数器(nBit),这显然是错过的优化. cl中需要使用移位计数来进行可变计数移位(除非您启用了BMI2指令,否则MSVC甚至可以做到这一点.) 此处(快速"版本中)缺少主要的优化,因此您可能应该以不同的方式编写源代码,以使编译器减少错误代码的生成.它是从字面上实现的,而不是将其转换为CPU可以有效执行的操作,或者使用bt reg,reg/setc如何在asm或内在函数中快速做到这一点使用SSE2/AVX.将正确的字节(包含相应的位)放入向量的每个字节元素中,并使用具有该元素的正确位的掩码将PANDN(用于反转向量). PCMPEQB相对于零.那给你0/-1.要获取ASCII数字,请使用_mm_sub_epi8(set1('0'), mask)将ASCII '0'减去0或-1(加0或1),有条件地将其转换为'1'.此操作的第一步(从位掩码获得0/-1的向量)是如何执行_mm256_movemask_epi8(VPMOVMSKB)的逆运算?. 最快将32位数据解压缩为32字节SIMD向量的方法(版本为128b).如果没有SSSE3(pshufb),我认为punpcklbw/punpcklwd(也许还有pshufd)是您需要重复num的每个字节8次并生成两个16字节向量的方法. 是否存在反指令到intel avx2中的movemask指令?.在标量代码中,这是一种以每个时钟1位->字节运行的方式.可能有不使用SSE2就能做得更好的方法(一次存储多个字节以使当前所有CPU上存在的每个时钟瓶颈绕1个存储区),但是为什么要打扰呢?只需使用SSE2. mov eax, [num] lea rdi, [rsp + xxx] ; bits[].loop: shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg shr eax, 1 setc [rdi+1] add rdi, 2 cmp end_pointer ; compare against another register instead of a separate counter. jb .loop展开两次以避免前端出现瓶颈,因此每个时钟可以运行1位.The purpose of the next two code section is to print number in binary.The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.the first code section:#include <intrin.h>#include <stdio.h>#include <Windows.h>long num = 78002;int main(){ unsigned char bits[32]; long nBit; LARGE_INTEGER a, b, f; QueryPerformanceCounter(&a); for (size_t i = 0; i < 100000000; i++) { for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = _bittest(&num, nBit); } } QueryPerformanceCounter(&b); QueryPerformanceFrequency(&f); printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart); printf_s("Binary representation:\n"); while (nBit--) { if (bits[nBit]) printf_s("1"); else printf_s("0"); } return 0;}the inner loop is compile to the instructions bt and setbThe second code section:#include <intrin.h>#include <stdio.h>#include <Windows.h>long num = 78002;int main(){ unsigned char bits[32]; long nBit; LARGE_INTEGER a, b, f; QueryPerformanceCounter(&a); for (size_t i = 0; i < 100000000; i++) { long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; } } QueryPerformanceCounter(&b); QueryPerformanceFrequency(&f); printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart); printf_s("Binary representation:\n"); while (nBit--) { if (bits[nBit]) printf_s("1"); else printf_s("0"); } return 0;}The inner loop compile to and add(as shift left) and sar.the second code section run three time faster then the first one.Why three cpu instructions run faster then two? 解决方案 I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code)._bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake: for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = _bittest(&num, nBit); //bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient }Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer$LL7@main: bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput lea rcx, QWORD PTR [rcx+1] ; 1 uop setb al ; 1 uop inc ebx ; 1 uop mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data) cmp ebx, 31 jb SHORT $LL7@main ; 1 uop (macro-fused with cmp)That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx. The inner loop compile to and add(as shift left) and sar.Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)But the whole loop is this clunky mess: long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; }$LL18@main: # MSVC CL19 -Ox mov ecx, ebx ; 1 uop lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits mov eax, r9d ; 1 uop. r9d holds num inc ebx ; 1 uop and eax, edx ; 1 uop # MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not. add edx, edx ; 1 uop sar eax, cl ; 3 uops (variable-count shifts suck) mov BYTE PTR [r8-1], al ; 1 uop (micro-fused) cmp ebx, 31 jb SHORT $LL18@main ; 1 uop (macro-fused with cmp)So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setcHow to do this fast in asm or with intrinsicsUse SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.is there an inverse instruction to the movemask instruction in intel avx2?.In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2. mov eax, [num] lea rdi, [rsp + xxx] ; bits[].loop: shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg shr eax, 1 setc [rdi+1] add rdi, 2 cmp end_pointer ; compare against another register instead of a separate counter. jb .loopUnrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock. 这篇关于为什么使用算术而不是_bittest以二进制形式打印数字更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-26 07:48
查看更多