C ++数组Halide图像（和背面）

本文介绍了C ++数组Halide图像（和背面）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我开始使用Halide，虽然我掌握了它的设计的基本原则，但我正在努力处理有效地计划计算所需的细节（read：magic）。

我在一个使用Halide的MWE下面发布了一个将数组从一个位置复制到另一个位置。我假设这将编译下来只有一些指令，并采取小于1微秒运行。相反，它产生4000行的汇编，需要40毫秒运行！因此，在我的理解中，我有一个重要的洞。

在<$ c中包装现有数组的规范方式是什么$ c> Halide :: Image ？

如何计划函数 copy $ p> #include< Halide.h> 使用命名空间Halide; void _copy（uint8_t * in_ptr，uint8_t * out_ptr，const int M，const int N）{ Image< uint8_t> in（Buffer（UInt（8），N，M，0，0，in_ptr））; 图片< uint8_t> out（Buffer（UInt（8），N，M，0，0，out_ptr））; Var x，y; Func copy; copy（x，y）= in（x，y）; copy.realize (out）; } int main（void）{ uint8_t in [10000]，out [10000]; _copy（in，out，100，100）; }
编译标志
clang ++ -O3 -march = native -std = c ++ 11 -Iinclude -Lbin -lHalide copy.cpp

解决方案
让我从第二个问题开始： _copy 需要很长时间，因为它需要编译Halide代码到x86机器代码。 IIRC， Func 缓存机器码，但由于 copy 是本地的 _copy 那个缓存不能重复使用。无论如何，调度 copy 很简单，因为它是一个点序操作：首先，它可能是有意义的向量化它。第二，它可能有意义的并行化（取决于有多少数据）。例如：
will vectorize along x with a vector size of 32 and parallelize along y. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...
There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt and profiling the code. I also use the AOT compilation provided by Halide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.
Your other question was, how to wrap an existing array in a Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t for everything image related. There is also C++ wrapper called Halide::Buffer that makes using buffer_t a little easier, I think it can also be used in Func::realize instead of Halide::Image. The point is: If you understand buffer_t you can wrap almost everything into something digestible by Halide.

这篇关于C ++数组Halide图像（和背面）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！