问题描述
我想在图像转换期间实现一个相当简单的平均。我已经成功地实现了转换,但现在我必须处理这个结果图像通过总和所有5x5像素矩形的所有像素。我的想法是增加一个计数器为每个这样的5x5块,每当这个块中的一个像素设置。然而,这些块计数器到目前为止还不够频繁地递增。所以为了调试,我检查了这样的块的所有像素被频繁地命中:
I am trying to implement a rather simple averaging during transformation of an image. I already successfully implemented the transformation, but now I have to process this resulting image by summing up all pixels of all 5x5 pixels rectangles. My Idea was to increment a counter for each such 5x5 block whenever a pixel in this block is set. However, these block-counters are by far not incremented often enough. So for debugging I checked how often any pixel of such a block is hit at all:
int x = (blockIdx.x*blockDim.x) + threadIdx.x;
int y = (blockIdx.y*blockDim.y) + threadIdx.y;
if((x<5)&&(y<5))
{
resultArray [0]++;
}
内核被这样调用:
dim3 threadsPerBlock(8, 8);
dim3 grid(targetAreaRect_px._uiWidth / threadsPerBlock.x, targetAreaRect_px._uiHeight / threadsPerBlock.y);
CudaTransformAndAverageImage << < grid, threadsPerBlock >> > (pcPreRasteredImage_dyn, resultArray );
我期望resultArray [0]在内核执行后包含25,但它只包含1。这是由于CUDA编译器的一些优化?
I would expect resultArray [0] to contain 25 after kernel execution, but it only contains 1. Is this due to some optimization by the CUDA compiler?
推荐答案
这:
if((x<5)&&(y<5))
{
resultArray [0]++;
}
是读后写入危险。
满足(x 的所有主题尝试从
resultArray [0]
中同时读取和写入。 CUDA执行模型不保证同时内存事务的顺序。
All of the threads which satisfy (x<5)&&(y<5)
can potentially attempt simultaneous reads and writes from resultArray[0]
. The CUDA execution model does not guarantee anything about the order of simultaneous memory transactions.
您可以通过使用原子内存事务来实现这一点,例如:
You could make this work by using atomic memory transactions, for example:
if((x<5)&&(y<5)) {
atomicAdd(&resultArray[0], 1);
}
这将序列化内存事务并使计算正确。
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
您可能希望调查每个块使用缩减类型计算计算本地和,然后将块本地和以原子方式或在主机上,或在第二个内核中。
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
这篇关于CUDA坐标点击的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!