本文介绍了Cuda拜耳/ CFA去马赛克示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个CUDA4拜耳去马赛克例程,但它比单线程CPU代码慢,运行在一个16核GTS250。

块大小是(16,16),图像昏暗是16的倍数 -

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.

Am I doing anything obviously stupid?

---------------调用例程------------------
uchar4 * d_output;
size_t num_bytes;

cudaGraphicsMapResources(1,& cuda_pbo_resource,0);
cudaGraphicsResourceGetMappedPointer((void **)& d_output,& num_bytes,cuda_pbo_resource);

//进行转换,将结果保留在PBO中fordisplay
kernel_wrapper(imageWidth,imageHeight,blockSize,gridSize,d_output);

cudaGraphicsUnmapResources(1,& cuda_pbo_resource,0);

--------------- cuda -------------------------- -----
texture< uchar,2,cudaReadModeElementType> tex;
cudaArray * d_imageArray = 0;

__global__ void convertGRBG(uchar4 * d_output,uint width,uint height)
{
uint x = __umul24(blockIdx.x,blockDim.x)+ threadIdx.x;
uint y = __umul24(blockIdx.y,blockDim.y)+ threadIdx.y;
uint i = __umul24(y,width)+ x;

//输入是GR / BG输出是BGRA
if((x< width)&&(y< height)){

if(y& 0x01){
if(x& 0x01){
d_output [i] .x =(tex2D(tex,x + 1,y)+ tex2D(tex,x-1 ,y))/ 2; // B
d_output [i] .y =(tex2D(tex,x,y)); // G in B
d_output [i] .z =(tex2D(tex,x,y + 1)+ tex2D(tex,x,y-1))/ 2; // R
} else {
d_output [i] .x =(tex2D(tex,x,y)); // B
d_output [i] .y =(tex2D(tex,x + 1,y)+ tex2D(tex,x-1,y)+ tex2D(tex,x,y + 1)+ tex2D tex,x,y-1))/ 4; // G
d_output [i] .z =(tex2D(tex,x + 1,y + 1)+ tex2D(tex,x + 1,y-1)+ tex2D(tex,x-1,y +1)+ tex2D(tex,x-1,y-1))/ 4; // R
}
} else {
if(x& 0x01){
// odd col = R
d_output [i] .y = (tex,x + 1,y + 1)+ tex2D(tex,x-1,y-1) )/ 4; // B
d_output [i] .z =(tex2D(tex,x,y)); // R
d_output [i] .y =(tex2D(tex,x + 1,y)+ tex2D(tex,x-1,y)+ tex2D(tex,x,y + 1)+ tex2D tex,x,y-1))/ 4; // G
} else {
d_output [i] .x =(tex2D(tex,x,y + 1)+ tex2D(tex,x,y-1))/ 2; // B
d_output [i] .y =(tex2D(tex,x,y)); // G in R
d_output [i] .z =(tex2D(tex,x + 1,y)+ tex2D(tex,x-1,y))/ 2; // R
}
}
}
}



void initTexture(int imageWidth,int imageHeight,uchar * imageata)
{

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8,0,0,0,cudaChannelFormatKindUnsigned);
cutilSafeCall(cudaMallocArray(& d_imageArray,& channelDesc,imageWidth,imageHeight));
uint size = imageWidth * imageHeight * sizeof(uchar);
cutilSafeCall(cudaMemcpyToArray(d_imageArray,0,0,imagedata,size,cudaMemcpyHostToDevice));
cutFree(imagedata);

//使用点采样将数组绑定到纹理参考
tex.addressMode [0] = cudaAddressModeClamp;
tex.addressMode [1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;

cutilSafeCall(cudaBindTextureToArray(tex,d_imageArray));
}


推荐答案

您的代码中有bug,但有几个明显的性能机会:

There aren't any obvious bugs in your code, but there are several obvious performance opportunities:

1)为了获得最佳性能,应该使用纹理到共享内存中 - 参见SobelFilterSDK样本。

1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.

2)如所写,代码正在将字节写入全局内存,这保证了大的性能命中。您可以使用共享内存在将结果提交到全局内存之前对结果进行排序。

2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.

3)以匹配硬件纹理的方式调整块大小,具有惊人的大的性能优势缓存属性。在特斯拉级硬件上,使用与内核相同的寻址方案的内核的最佳块大小为16x4。 (每个块64个线程)

3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)

对于像这样的工作负载,可能很难与优化的CPU代码竞争。 SSE2可以在单个指令中执行16个字节大小的操作,CPU的时钟速率约为其5倍。

For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.

这篇关于Cuda拜耳/ CFA去马赛克示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-09 12:53