问题描述
我开始使用Halide,虽然我掌握了它的设计的基本原则,但我正在努力处理有效地计划计算所需的细节(read:magic)。
我在一个使用Halide的MWE下面发布了一个将数组从一个位置复制到另一个位置。我假设这将编译下来只有一些指令,并采取小于1微秒运行。相反,它产生4000行的汇编,需要40毫秒运行!因此,在我的理解中,我有一个重要的洞。
- 在<$ c中包装现有数组的规范方式是什么$ c> Halide :: Image ?
- 如何计划函数
copy
$ p>#include< Halide.h>
使用命名空间Halide;
void _copy(uint8_t * in_ptr,uint8_t * out_ptr,const int M,const int N){
Image< uint8_t> in(Buffer(UInt(8),N,M,0,0,in_ptr));
图片< uint8_t> out(Buffer(UInt(8),N,M,0,0,out_ptr));
Var x,y;
Func copy;
copy(x,y)= in(x,y);
copy.realize (out);
}
int main(void){
uint8_t in [10000],out [10000];
_copy(in,out,100,100);
}
编译标志
clang ++ -O3 -march = native -std = c ++ 11 -Iinclude -Lbin -lHalide copy.cpp
解决方案让我从第二个问题开始:
_copy
需要很长时间,因为它需要编译Halide代码到x86机器代码。 IIRC,Func
缓存机器码,但由于copy
是本地的_copy
那个缓存不能重复使用。无论如何,调度copy
很简单,因为它是一个点序操作:首先,它可能是有意义的向量化它。第二,它可能有意义的并行化(取决于有多少数据)。例如:will vectorize along
x
with a vector size of 32 and parallelize alongy
. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...There is no recipe for good scheduling. I do it by looking at the output of
compile_to_lowered_stmt
and profiling the code. I also use the AOT compilation provided byHalide::Generator
, this makes sure that I only measure the runtime of the code and not the compile time.Your other question was, how to wrap an existing array in a
Halide::Image
. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type calledbuffer_t
for everything image related. There is also C++ wrapper calledHalide::Buffer
that makes usingbuffer_t
a little easier, I think it can also be used inFunc::realize
instead ofHalide::Image
. The point is: If you understandbuffer_t
you can wrap almost everything into something digestible by Halide.这篇关于C ++数组Halide图像(和背面)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!