本文介绍了CUDA坐标点击的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在图像转换期间实现一个相当简单的平均。我已经成功地实现了转换,但现在我必须处理这个结果图像通过总和所有5x5像素矩形的所有像素。我的想法是增加一个计数器为每个这样的5x5块,每当这个块中的一个像素设置。然而,这些块计数器到目前为止还不够频繁地递增。所以为了调试,我检查了这样的块的所有像素被频繁地命中:

I am trying to implement a rather simple averaging during transformation of an image. I already successfully implemented the transformation, but now I have to process this resulting image by summing up all pixels of all 5x5 pixels rectangles. My Idea was to increment a counter for each such 5x5 block whenever a pixel in this block is set. However, these block-counters are by far not incremented often enough. So for debugging I checked how often any pixel of such a block is hit at all:

    int x = (blockIdx.x*blockDim.x) + threadIdx.x;
    int y = (blockIdx.y*blockDim.y) + threadIdx.y;

    if((x<5)&&(y<5))
{
    resultArray [0]++;
}

内核被这样调用:

dim3 threadsPerBlock(8, 8);
dim3 grid(targetAreaRect_px._uiWidth / threadsPerBlock.x, targetAreaRect_px._uiHeight / threadsPerBlock.y);
CudaTransformAndAverageImage << < grid, threadsPerBlock >> > (pcPreRasteredImage_dyn, resultArray );

我期望resultArray [0]在内核执行后包含25,但它只包含1。这是由于CUDA编译器的一些优化?

I would expect resultArray [0] to contain 25 after kernel execution, but it only contains 1. Is this due to some optimization by the CUDA compiler?

推荐答案

这:

if((x<5)&&(y<5))
{
    resultArray [0]++;
}

是读后写入危险。

满足(x 的所有主题尝试从 resultArray [0] 中同时读取和写入。 CUDA执行模型不保证同时内存事务的顺序。

All of the threads which satisfy (x<5)&&(y<5) can potentially attempt simultaneous reads and writes from resultArray[0]. The CUDA execution model does not guarantee anything about the order of simultaneous memory transactions.

您可以通过使用原子内存事务来实现这一点,例如:

You could make this work by using atomic memory transactions, for example:

if((x<5)&&(y<5)) {
    atomicAdd(&resultArray[0], 1);
}

这将序列化内存事务并使计算正确。

This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.

您可能希望调查每个块使用缩减类型计算计算本地和,然后将块本地和以原子方式或在主机上,或在第二个内核中。

You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.

这篇关于CUDA坐标点击的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 09:16