本文介绍了CUDA全球障碍 - 开普勒而不是费米的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下全局屏障适用于Kepler K10而不是Fermi GTX580:

The following global barrier works on Kepler K10 and not Fermi GTX580:

__global__ void cudaKernel (float* ref1, float* ref2, int* lock, int time, int dim) {
  int gid  = blockIdx.x * blockDim.x + threadIdx.x;
  int lid  = threadIdx.x;                          
  int numT = blockDim.x * gridDim.x;               
  int numP = int (dim / numT);                     
  int numB = gridDim.x;

  for (int t = 0; t < time; ++t) {
    // compute @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref2 [idx]  = 0.333f * ((ref1 [idx - 1] + ref1 [idx]) + ref1 [idx + 1]);
    }

    // global sync
    if (lid == 0){
      atomicSub (lock, 1);
      while (atomicCAS(lock, 0, 0) != 0);
    }
    __syncthreads();

    // copy-back @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref1 [idx]  = ref2 [idx];
    }

    // global sync
    if (lid == 0){
      atomicAdd (lock, 1);
      while (atomicCAS(lock, numB, numB) != numB);
    }
    __syncthreads();
  }
}

因此,通过查看发送回CPU的输出,我注意到一个线程(第一个或最后一个线程)逃脱屏障,并恢复执行比其他人。我使用CUDA 5.0。块数也总是小于SM的数量(在我的运行集合)。

So, by looking at the output sent back to CPU, I noticed that one thread (either 1st or last thread) escapes the barrier and resumes execution earlier than the others. I'm using CUDA 5.0. number of blocks is also always smaller than number of SMs (in my set of runs).

任何想法为什么相同的代码不能在两个架构上工作?

Any idea why the same code wouldn't work on two architectures? What's new in Kepler that helps this global synchronization?

推荐答案

所以我怀疑屏障代码本身可能工作方式相同。

So I suspect the barrier code itself is probably working the same way. It's what's happening on other data structures not associated with the barrier functionality itself that is at issue, it seems.

Niether Kepler和Fermi都有相互一致的L1缓存,这与其他数据结构没有关联。 。你发现的(虽然它不与你的屏障代码本身相关联)是,L1缓存行为是不同的和。

Niether Kepler nor Fermi have L1 caches that are coherent with each other. What you have discovered (although it's not associated with your barrier code itself) is that the L1 cache behavior is different between Kepler and Fermi.

特别是,Kepler L1缓存不是在上述链接中描述的全局负载上进行的,因此缓存行为在L2级处理是设备范围的,因此是连贯的。当一个Kepler SMX读取它的全局数据,它从L2获得连贯的值。

In particular, Kepler L1 cache is not in play on global loads as described in the above link, and so the caching behavior is handled at L2 level which is device-wide, and therefore coherent. When a Kepler SMX reads it's global data, it's getting coherent values from L2.

另一方面,Fermi有L1缓存也参与全局加载 - 尽管这种行为可以被关闭),并且如上面的链路中所描述的L1高速缓存对于每个费米SM是唯一的,并且与其他SM中的L1高速缓存不一致。当Fermi SM读取它的全局数据时,它从L1获得值,这可能与其他SM中的其他L1缓存不一致。

On the other hand, Fermi has L1 caches that also participate in global loads (by default -- although this behavior can be turned off) and the L1 caches as described in the link above are unique to each Fermi SM and are non-coherent with the L1 caches in other SMs. When a Fermi SM reads it's global data, it's getting values from the L1, which may be non-coherent with other L1 caches in other SMs.

这是区别

正如我所提到的,我相信屏障代码本身可能工作是一样的方式在两个设备上。

As I mentioned, I believe the barrier code itself is probably working the same way on both devices.

这篇关于CUDA全球障碍 - 开普勒而不是费米的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 12:11