本文介绍了CUDA小内核2d卷积 - 如何做到的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试验CUDA内核,在500x500图像(但是我也可以改变尺寸)和一个非常小的2D内核(一个laplacian 2d内核,所以它是一个3x3内核)之间执行快速2D卷积。

I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads).

我创建了一个CPU经典的实现(两个for循环,你想象的那么简单),然后我开始创建CUDA内核。

I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels.

在尝试执行更快卷积的几个令人失望的尝试后,我最终得到了以下代码:
(请参阅共享内存部分),它基本上允许一个16x16线程块在共享内存中加载所需的所有卷积数据,然后执行卷积。

After a few disappointing attempts to perform a faster convolution I ended up with this code:http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution.

没有什么,CPU还是快了很多。我没有尝试FFT方法,因为CUDA SDK声明它是有效的大内核大小。

Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes.

无论你是否阅读我写的一切,我的问题是:

Whether or not you read everything I wrote, my question is:

如何使用CUDA在较大的图像和非常小的内核(3x3)之间执行快速2D卷积?

推荐答案

你是正确的3x3内核不适合基于FFT的方法。处理这个问题的最好方法是将内核推入常量内存(或者如果你使用fermi +卡,这不应该太大了)。

You are right in that 3x3 kernel is not suitable for FFT based approach. The best way to deal with this would be to push the kernel into constant memory (or if you are using a fermi+ card, this should not matter too much).

你知道内核大小,最快的方法是将输入图像/信号的块读入共享内存,并执行展开的乘法和添加操作。

Since you know kernel size, the fastest way to do this would be to read chunks of the input image / signal into shared memory and perform an unrolled multiply and add operation.

-

如果您愿意使用库执行此操作和
具有高度优化的卷积

If you are willing to use libraries to perform this operation ArrayFire and OpenCVhave highly optimized Convolution routines that can save you a lot of development time.

我不太熟悉OpenCV,但是在ArrayFire中你可以做如下的事情。

I am not too familiar with OpenCV, but in ArrayFire you can do something like the following.

array kernel = array(3, 3, h_kernel, afHost); // Transfer the kernel to gpu
array image  = array(w, h, h_image , afHost); // Transfer the image  to gpu
array result = convolve2(image, kernel);       // Performs 2D convolution

EDIT

使用ArrayFire的额外的好处是它的批处理操作允许您并行执行卷积。您可以通过了解有关卷积式如何支持批量操作

The added benefit of using ArrayFire is its batched operation allows you to perform convolution in parallel. You can read about how convolvutions support batch operations over here

例如,如果你有10个图像要使用相同的内核进行卷积,你可以像下面这样做:

For example if you had 10 images that you want to convolve using the same kernel, you could do somehting like the following:

array kernel = array(3, 3, h_kernel, afHost);     // Transfer the kernel to gpu
array images = array(w, h, 10, h_images, afHost); // Transfer the images to gpu
array res    = convolve2(images, kernel); // Perform all operations simultaneously

-

完全公开:我在AccelerEyes工作,积极工作在ArrayFire。

Full Disclosure: I work at AccelerEyes and actively work on ArrayFire.

这篇关于CUDA小内核2d卷积 - 如何做到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 04:09