问题描述
我要重写这种简单的程序来SSE2 code,(preferably
在NASM),我不能完全确定该怎么做,有两件事
尚不清楚(怎么EX preSS计算(内环和来自
外环太)以及如何调用C code函数SetPixelInDibInt(I,J,调色板[N]);
从下staticaly联ASM code
I want to rewrite such simple routine to SSE2 code, (preferablyin nasm) and I am not totally sure how to do it, two thingsnot clear (how to express calculations (inner loop and those fromouter loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);"from under staticaly linked asm code
void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER)
{
double ly = lx * double(CLIENT_Y)/double(CLIENT_X);
double dx = lx / CLIENT_X;
double dy = ly / CLIENT_Y;
double ax = ox - lx * 0.5 + dx * 0.5;
double ay = oy - ly * 0.5 + dy * 0.5;
static double re, im, re_n, im_n, c_re, c_im, rere, imim, int n;
for(int j=0; j<CLIENT_Y; j+=1)
{
for(int i=0; i<CLIENT_X; i+=1)
{
c_re = ax + i * dx;
c_im = ay + j * dy;
re = c_re;
im = c_im;
rere=re*re;
imim=im*im;
n=1;
for(int k=0;k<N_ITER;k++)
{
im = (re+re)*im + c_im;
re = rere - imim + c_re;
rere=re*re;
imim=im*im;
if ( (rere + imim) > 4.0 ) break;
n++;
}
SetPixelInDibInt(i ,j, palette[n]);
}
}
}
可能有人帮助,我想不看其他code
实现,但高于刚NASM-SSE翻译
- 这将是最有益的在我的情况? - 可能有人提供帮助的
could someone help, I would like not to see other codeimplementations but just nasm-sse translation of those above- it would be most helpfull in my case - could someone help with that?
推荐答案
英特尔拥有一套完整的实现作为一个AVX例子。见下文。
Intel has a complete implementation as an AVX example. See below.
什么使曼德尔布罗棘手的是,对于在该组(即,像素)的每个点的早期出条件是不同的。你可以保持一个对或迭代像素的四直到两个的幅度超过2.0(或者你打的最大迭代)。如果不这样做就需要跟踪该像素的点是在向量元素。
What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.
反正简单化实现对一个向量操作的2(或4 AVX)在一个时间加倍将具有其可以通过由依赖性链的延迟限制。你需要做多个依存关系链并行,以保持双方的Haswell喂养的FMA单位。所以,你会复制你的变量,并交织业务为内环内外部循环的两个迭代。
Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.
饲养轨道。我想,这可能需要开销少用一组寄存器用于一行像素,并为另一行另一组寄存器。 (所以你可以永远只是移动4像素的权利,而不是检查其他DEP链是否已经处理了向量。)
Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)
我怀疑只有检查循环退出条件,每4次迭代左右可能是一个双赢。获得code支线基于在一个拥挤的向量比较,略比标量的情况下更昂贵。所需的额外附加FP还贵。 (Haswell的可以做到每个周期通过两个FMAS,(延迟= 5)。孤独的FP加单位是相同的端口作为FMA单位之一。这两个FP MUL单位上可以运行FMA相同的端口。)
I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)
循环条件可以与被检查的填料 - 比较生成的零和一的掩模,和一个(Ⅴ)PTEST
的,与自身,以查看是否登记这一切都为零。 (编辑: movmskps
然后测试+ JCC
是更少的微指令,但也许更高的延迟。)那么显然 JE
或 JNE
吨。 NAN不应该是可能的,但没有理由不选择你的比较运算,使得NAN将导致退出条件是真实的。
The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST
of that register with itself to see if it's all zero. (edit: movmskps
then test+jcc
is fewer uops, but maybe higher latency.) Then obviously je
or jne
as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.
const __mm256d const_four = _mm256_set1_pd(4.0); // outside the loop
__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ); // vcmppd. result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
break;
有可能是某种方式使用 PTEST
直接在包装双,一些位黑客和面罩,将挑选出将被设置IFF位FP值> 4.0。就像在指数或许有些位?也许值得考虑。我发现了一个这件事,但没有尝试它。
There MIGHT be some way to use PTEST
directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.
嗯,哦,废话,这不,当循环条件不合格,每个向量元素分开,着色Mandelbrot集外点的目的进行录制。也许测试的任何的元素撞击的条件(而不是全部),记录结果,然后设置元素(和 C
该元素)到0.0,所以它不会再次触发退出条件。或者,也许调度像素为矢量元素毕竟是要走的路。这code可以做相当不错超线程CPU上,因为会有很多分支误的predicts与每一个元素分别触发早期出的条件。
Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c
for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.
这可能会浪费了很多的吞吐量,并考虑到每个周期有4个微指令是可行的,但其中只有2可以是FP MUL /添加/ FMA,有空间为整数code的显著量安排点进去矢量元素。 (SandyBridge的上/ Ivybrideg,没有FMA,FP吞吐量较低,但只有3个端口,可用于处理整数OPS,和那些2为FP MUL和FP加上单位的端口。)
That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)
既然你没有读取任何数据源,这里只有每个DEP链1内存访问流,这是一个写流。 (和它的低带宽,因为你是准备写一个像素值之前最高分走了很多的迭代。)所以硬件prefetch流的数量不是DEP链的数量的限制因素并行运行。高速缓存未命中的延迟应该写缓冲区被隐藏。
Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.
如果有人仍然有兴趣在此我可以写一些code(只是发表评论)。我停在了高层次的设计阶段,因为这是一个老问题,但。
I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.
我还发现,英特尔已经使用Mandelbrot集作为一个例子为自己的的。他们使用的循环条件面具断矢量元素的方法。 (使用 vcmpps
直接生成面膜AND)。他们的研究结果表明,AVX(单precision)超过了标量浮动一个7倍的提速,所以显然它不是相邻像素在非常不同的数字迭代打早出的条件普遍。 (至少他们与测试的缩放/平移。)
I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps
to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)
他们只是让FP结果不断积累的失败早出状态的元素。他们只是停止增加该元素的计数器。希望大多数系统默认为具有控制字设定为零了非正规,非正规数是否仍然需要额外的周期。
They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.
他们的code是一个愚蠢的方式,虽然:他们跟踪了浮点矢量每个矢量元素的迭代次数,然后将其转换为使用前年底为int。这将会是更快,而不是占据FP执行单元,使用包装-整数为。哦,我知道他们为什么这样做:AVX(不含AVX2)不支持256bit的整数向量欢声笑语。他们可以使用16位包装INT循环计数器,但可能溢出。 (他们不得不COM preSS面具低于256B 128B到)。
Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).
他们还测试是所有元素> 4.0 movmskps
键,然后测试,而不是使用 PTEST
。我猜测试/ JCC
可以宏观保险丝,比FP向量OPS不同的执行单元上运行,所以它也许不是更慢。哦,当然,AVX(不包括AVX2)不具有256bit的 PTEST
。此外, PTEST
2微指令,所以实际上 movmskps
+ 测试/ JCC
比 PTEST + JCC 。 ( PTEST
是SNB 1融合域UOP公司,但对于执行端口还是2非融合微指令。在IVB / HSW,2微指令甚至在融合领域。)所以这看起来像 movmskps
是最佳的方式,除非你可以采取按位优势,这就是 PTEST
的一部分,或者需要测试不仅仅是各元素的高比特以上。如果一个分支是联合国predictable, PTEST
可能是更低的延迟,因此是值得的,通过捕捉错误predicts周期更快。
They also test for all elements being > 4.0 with movmskps
and then test that, instead of using ptest
. I guess the test / jcc
can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST
. Also, PTEST
is 2 uops, so actually movmskps
+ test / jcc
is fewer uops than ptest + jcc
. (PTEST
is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps
is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST
, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest
might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.
这篇关于从C到SSE2一些曼德勃罗绘图程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!