在CUDA中的块之间分配线程

本文介绍了在CUDA中的块之间分配线程的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用CUDA中的项目。第一次我使用 Dim 8 * 8 作为我的矩阵只有一个块。然后我计算索引如下：

I'm working on a project in CUDA. The first time I used only one block with Dim 8*8 as my matrix. And then I calculated the index as follows:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

它给了我一个正确的答案。之后，我想在块之间分配线程来衡量性能。我使网格变暗为（2,1），块变暗为（4,8）。

And it gave me a correct answer. After that I want to distribute the threads between blocks to measure the performance. I make the grid dim to be (2,1) and the block dim to be (4,8).

当我手动调试代码时，给我正确的索引，而不改变上面提到的公式。但是当我运行程序时，屏幕挂起，结果都是零。

When I debug the code by hand, it seems to give me the correct index without changing the formula mentioned above. But when I run the program, the screen hangs and the results are all zero.

我做错了什么，如何解决这个问题？

What did I do wrong, and how can I fix this?

这是内核函数

__global__ void cover_fault(int *a,int *b, int *c, int *d, int *mulFV1, int *mulFV2,     int *checkDalU1, int *checkDalU2, int N)

 {
//Fig.2
__shared__ int f[9][9];
__shared__ int compV1[9],compV2[9];
int dalU1[9] , dalU2[9];
int Ra=2 , Ca=2;
for (int i = 0 ; i < N ; i++)
  for (int j = 0 ; j < N ; j++)
         f[i][j]=0;

f[3][0] = 1;
f[0][2] = 1;
f[0][6] = 1;
f[3][7] = 1;
f[2][4] = 1;
f[6][4] = 1;
f[7][1] = 1;

int t =0 ,A = 1,B = 1 , UTP = 5 , LTP = -5 , U_max = 40 , U_min = -160;
bool flag = true;
int sumV1, sumV2;
int checkZero1 , checkZero2;


int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

while ( flag == true)
{
    if ( c[idy] == 0 )
            compV1[idy] = 1;

       else if ( c[idy]==1)
                compV1[idy] = 0 ;

       if ( d[idy] == 0 )
            compV2[idy] = 1;

       else if ( d[idy]==1 )
                  compV2[idy] = 0 ;


   sumV1 = reduce ( c, N );
   sumV2 = reduce ( d, N );


   if (idx<N && idy <N)
    {
     if(idx==0)
          mulFV1[idy]=0;
     if(idy==0)
          mulFV2[idx]=0;

     __syncthreads();

     atomicAdd(&(mulFV1[idy]),f[idy][idx]*compV2[idx]);
     atomicAdd(&(mulFV2[idx]),f[idy][idx]*compV1[idy]);

      }


    dalU1[idy] = ( -1*A*( sumV1 - Ra )) + (B * mulFV1[idy] * compV1[idy]) ;
    dalU2[idy] = ( -1*A*( sumV2 - Ca )) + (B * mulFV2[idy] * compV2[idy]) ;


    a[idy] = a[idy] + dalU1[idy];
    b[idy] = b[idy] + dalU2[idy];


       if ( a[idy] > U_max )
               a[idy] = U_max;
       else
           if (a[idy] < U_min )
                a[idy] = U_min;

       if ( b[idy] > U_max )
                 b[idy] = U_max;
       else
           if (b[idy] < U_min )
                b[idy] = U_min;


      if (dalU1[idy]==0)
           checkDalU1[idy]=0;
         else
            checkDalU1[idy]=1;

      if (dalU2[idy]==0)
           checkDalU2[idy]=0;
            else
               checkDalU2[idy]=1;

       __syncthreads();
      checkZero1 = reduce(checkDalU1,N);
      checkZero2 = reduce(checkDalU2,N);

      if ( checkZero1==0 && checkZero2==0)
               flag = false;


      else
      {

       if ( a[idy] > UTP )
              c[idy] = 1;
           else
               if ( a[idy] < LTP )
                      c[idy] = 0 ;

            if ( b[idy] > UTP )
                   d[idy] = 1;
           else
               if ( b[idy] < LTP )
                      d[idy] = 0 ;

      t++;

      }//end else
      sumV1=0;
      sumV2=0;
      mulFV1[idy]=0;
      mulFV2[idy]=0;
      } //end while

}//end function

在CUDA中的块之间分配线程

问题描述

推荐答案