如您所见,sqrt 是在 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) 中完成的,其中位移和 LUT 可以是硬接线.如果与其他门相比,您有超快的门,您可以将输入时钟乘以 16 并将其用作 SQRT 模块的内部时序.类似于过去在旧的 Intel CPU/MCU s 中有 MC 时钟作为源 CPU 时钟的分工的东西......这种方式您可以获得 1T 计时(或它的倍数取决于乘法比).I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit. 解决方案 [Edit1] repaired codeRecently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code://---------------------------------------------------------------------------WORD u32_sqrt(DWORD xx) // 16 T { DWORD x,m,a0,a1,i; const DWORD lut[16]= { // m*m 0x40000000, 0x10000000, 0x04000000, 0x01000000, 0x00400000, 0x00100000, 0x00040000, 0x00010000, 0x00004000, 0x00001000, 0x00000400, 0x00000100, 0x00000040, 0x00000010, 0x00000004, 0x00000001, }; for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++) { a1=a0+lut[i]+(x<<(16-i)); if (a1<=xx) { a0=a1; x|=m; } } return x; }//---------------------------------------------------------------------------Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:x1 = x0+mx1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.In the code a0 represents last x*x and a1 represents actual iterated x*xAs you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio). 这篇关于如何仅在一个时钟周期内获得 32 位输入的平方根?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!