我在看这个使用cuda的DCT实现:http://www.cse.nd.edu/courses/cse60881/www/source_code/dct8x8/dct8x8_kernel1.cu
问题在于:
__shared__ float CurBlockLocal1[BLOCK_SIZE2];
__global__ void CUDAkernel1DCT(float *Dst, int ImgWidth, int OffsetXBlocks, int OffsetYBlocks)
{
// Block index
const int bx = blockIdx.x + OffsetXBlocks;
const int by = blockIdx.y + OffsetYBlocks;
// Thread index (current coefficient)
const int tx = threadIdx.x;
const int ty = threadIdx.y;
// Texture coordinates
const float tex_x = (float)( (bx << BLOCK_SIZE_LOG2) + tx ) + 0.5f;
const float tex_y = (float)( (by << BLOCK_SIZE_LOG2) + ty ) + 0.5f;
//copy current image pixel to the first block
CurBlockLocal1[ (ty << BLOCK_SIZE_LOG2) + tx ] = tex2D(TexSrc, tex_x, tex_y);
//synchronize threads to make sure the block is copied
__syncthreads();
其中块大小为8,因此块大小为3。
为什么纹理坐标是这样定义的?为什么我们需要使用纹理坐标?Cuda中的“<
最佳答案
按相反顺序回答问题:
在标准C或C++中,a << b
等同于a * 2^b
,其中a
和b
都是正整数。所以你要问的代码基本上是两个乘法的整数幂的简写。
如Cuda编程指南附录中所述,纹理使用以体素为中心的浮点坐标进行索引,这就是为什么您所发布代码中的读取参数在每个方向上偏移0.5的原因
您询问的代码看起来是为早期的CUDA硬件编写的,它的整数运算性能比浮点要慢得多。使用位移代替两个乘法的幂,很可能是一种性能优化,在新一代CUDA硬件上可能没有用处。
您询问的代码可能是__shared__ float CurBlockLocal1[BLOCK_SIZE2];
__global__ void CUDAkernel1DCT(float *Dst, int ImgWidth, int OffsetXBlocks, int OffsetYBlocks)
{
// Block index
const int bx = blockIdx.x + OffsetXBlocks;
const int by = blockIdx.y + OffsetYBlocks;
// Thread index (current coefficient)
const int tx = threadIdx.x;
const int ty = threadIdx.y;
// Texture coordinates
const float tex_x = (float)( (bx * BLOCK_SIZE) + tx ) + 0.5f;
const float tex_y = (float)( (by * BLOCK_SIZE) + ty ) + 0.5f;
//copy current image pixel to the first block
CurBlockLocal1[ (ty * BLOCK_SIZE) + tx ] = tex2D(TexSrc, tex_x, tex_y);
......
}
关于c - 库达的纹理坐标,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/9864411/