减少OpenCL中的矩阵

减少OpenCL中的矩阵

本文介绍了减少OpenCL中的矩阵行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵,该矩阵作为1D数组存储在GPU中,我正在尝试制作一个OpenCL内核,该内核将在矩阵的每一行中使用约简,例如:

I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example:

让我们考虑一下我的矩阵是2x3的元素[1、2、3、4、5、6],我要做的是:

Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is:

[1, 2, 3] = [ 6]
[4, 5, 6]   [15]

很显然,在我谈论减少时,实际的回报可能是每行不止一个元素:

Obviously as I'm talking about reduction, the actual return could be of more than one element per row:

[1, 2, 3] = [3, 3]
[4, 5, 6]   [9, 6]

然后我可以在另一个内核或CPU中进行最终计算.

Then the final calculation I can do in another kernel or in the CPU.

好吧,到目前为止,我所拥有的是一个内核,它可以进行简化操作,但是使用数组的所有元素,就像这样:

Well, so far what I have is a kernel which do the reduction but using all the elements of the array, like so:

[1, 2, 3] = [21]
[4, 5, 6]

执行此操作的实际归约内核是一个(我实际上是在stackoverflow中从这里得到的):

The actual reduction kernel for doing this is that one (which I got from here in stackoverflow actually):

__kernel void
sum2(__global float *inVector, __global float *outVector,
     const unsigned int inVectorSize, __local float *resultScratch)
{
  const unsigned int localId = get_local_id(0);
  const unsigned int workGroupSize = get_local_size(0);

  if (get_global_id(0) < inVectorSize)
    resultScratch[localId] = inVector[get_global_id(0)];
  else
    resultScratch[localId] = 0;

  for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (a > localId)
      resultScratch[localId] += resultScratch[localId + a];
  }

  if (localId == 0)
    outVector[get_group_id(0)] = resultScratch[0];
  barrier(CLK_LOCAL_MEM_FENCE);
}

推荐答案

我想一种解决方案是修改约简内核,这样就可以对数组的一部分进行约简.

I suppose one solution is to modify your reduction kernel, so it can make reduction of the part of the array.

__kernel void
sum2(__global float *inVector,
     __global float *outVector,
     unsigned int   inVectorOffset,
     unsigned int   inVectorSize,
     __local float  *resultScratch)
{
  const unsigned int localId = get_local_id(0);
  const unsigned int workGroupSize = get_local_size(0);

  if (get_global_id(0) < inVectorSize)
    resultScratch[localId] = inVector[inVectorOffset + get_global_id(0)];
  else
    resultScratch[localId] = 0;

  for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (a > localId)
      resultScratch[localId] += resultScratch[localId + a];
  }

  if (localId == 0)
    outVector[get_group_id(0)] = resultScratch[0];
  barrier(CLK_LOCAL_MEM_FENCE);
}

然后,您可以对矩阵的一行进行精简,提供inVectorOffset作为行的开头,并提供inVectorSize作为行中元素的数量.

Then you can do reduction of a row of a matrix, providing as inVectorOffset the beginning of your row and as inVectorSize number of elements in the row.

这篇关于减少OpenCL中的矩阵行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 23:37