运行时GPU或CPU执行?

本文介绍了运行时GPU或CPU执行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我觉得必须要有一种方法来编写代码，使其可以在CPU或GPU中运行.也就是说，我想编写具有(例如)CPU FFT实现的东西，该实现可以在没有GPU的情况下执行，但是当存在GPU时默认为GPU FFT.我一直无法提出正确的问题来使互连网提供解决方案.

I feel like there has to be a way to write code such that it can run either in CPU or GPU. That is, I want to write something that has (for example), a CPU FFT implementation that can be executed if there is no GPU, but defaults to a GPU FFT when the GPU is present. I haven't been able to craft the right question to get the interwebs to offer up a solution.

我的应用程序目标具有可用的GPU.我们想要编写某些功能来使用GPU.但是，我们的开发VM则完全不同.能够运行代码/单元测试周期而不必跳到GPU硬件似乎是非常可取的.

My application target has GPUs available. We want to write certain functions to use the GPUs. However, our development VMs are a different story. It seems very desirable to be able to run a code/unit-test cycle without having to jump to GPU hardware.

如果我需要进行一些聪明的运行时检查/库加载，我可以接受；我只需要一本菜谱.

If I need to do some clever run-time checking/library loading, I'm OK with that; I just need a cookbook.

人们如何持续集成支持GPU的代码?

How do people do continuous integration of GPU-enabled code?

目标环境为nVidia/CUDA.我是GPU代码的新手，所以也许这是一个常见问题解答(但我还没有找到).

Target environment is nVidia/CUDA. I'm new to GPU code, so maybe this is an FAQ (but I haven't found it yet).

推荐答案

我认为这应该很简单.

典型的方法是:

针对CUDA运行时库(cudart)库静态链接您的代码.如果使用 nvcc 进行编译，则这是默认行为.

Link your code statically against the CUDA Runtime library (cudart) library. If you compile with nvcc, this is the default behavior.

(大概)在代码开头，选择一个CUDA运行时API调用，例如 cudaGetDevice().使用某种形式的正确的CUDA错误检查(始终是一个好主意).在这种情况下，我们将使用从第一个运行时API调用返回的错误来做出路径决定(而不是仅仅终止应用程序).

(Presumably) near the beginning of your code, choose a CUDA Runtime API call such as cudaGetDevice(). Use some form of proper CUDA error checking (always a good idea, anyway). In this case we will use the error return from this first runtime API call to make our path decision (as opposed to just simply terminating the application).

如果上面第2步中的运行时API调用返回 cudaSuccess (作为功能返回值，而不是设备索引)，则可以安全地假定至少有1个功能CUDA GPU.在那种情况下，如果需要/需要，可以对环境进行进一步检查，也许遵循类似于CUDA deviceQuery 示例代码的顺序.此状态可以存储在您的程序中，以供以后决定要遵循的代码路径时使用.

If the runtime API call in step 2 above returns cudaSuccess (as the functional return value, not the device index), then it is safe to assume that there is at least 1 functional CUDA GPU. In that case, further inspection of the environment could be done if desired/needed, perhaps following a sequence similar to the CUDA deviceQuery sample code. This status could be stored in your program for future decision making about code paths to follow.

如果步骤2中的运行时API调用返回除 cudaSuccess 以外的任何内容，则几乎可以肯定意味着CUDA无法运行，可能是因为没有CUDA GPU.在这种情况下，我建议您不要进一步使用任何CUDA API或库，并且从那里开始，您的代码应使用仅主机的代码路径.

If the runtime API call in step 2 returns anything other than cudaSuccess, it almost certainly means that CUDA is non-functional, perhaps because there is no CUDA GPU. In that case, I'd advise against any further use of any CUDA API or library, and from there on your code should use host-only code paths.

这是一个完整的示例.如果找到功能性的CUDA环境，它将使用CUFFT库执行简单的FFT操作.否则，它将使用FFTW在主机代码中执行相同的操作.请注意，除了静态链接到cudart库( nvcc 的默认链接，所以并不明显)之外，我还静态链接到CUFFT库.至少在Linux上(如此处的示例所示)，这可以防止由于无法找到要链接的动态库而导致应用程序启动时失败(这将完全阻止应用程序运行；而在这种情况下，我们的意图是使应用程序运行运行，但选择主机代码路径.

Here is a fully worked example. It uses the CUFFT library to perform a simple FFT operation if a functional CUDA environment is found. Otherwise it uses FFTW to do the same thing in host code. Note that in addition to statically linking against the cudart library (the default with nvcc, so not obvious), I am also statically linking against the CUFFT library. At least on linux as in the example here, this prevents failures at application launch time because of an inability to find dynamic libraries to link against (which would prevent the application from running at all; whereas in that case our intent would be that the application runs but chooses host code paths).

$ cat t467.cu
#include <cufft.h>
#include <fftw.h>
#include <iostream>

int main(){

  double data[] = {0.0f, 1.0f, 0.0f, -1.0f, 0.0f, 1.0f, 0.0f, -1.0f};
  int N = sizeof(data)/sizeof(data[0]);
  int dev = 0;
  if (cudaGetDevice(&dev) == cudaSuccess) {
    // GPU code path
    cufftDoubleComplex *din, *dout, *in, *out;
    in  = new cufftDoubleComplex[N];
    out = new cufftDoubleComplex[N];
    for (int i = 0; i < N; i++) in[i].x = data[i];
    cudaError_t err = cudaMalloc(&din,  sizeof(din[0]) * N);
                err = cudaMalloc(&dout, sizeof(din[0]) * N);
    cufftHandle plan;
    cufftResult cstat = cufftPlan1d(&plan, N, CUFFT_Z2Z, 1);
    cudaMemcpy(din, in, N*sizeof(din[0]), cudaMemcpyHostToDevice);
    cstat = cufftExecZ2Z(plan, din, dout, CUFFT_FORWARD);
    cudaMemcpy(out, dout, N*sizeof(din[0]), cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) data[i] = out[i].x * out[i].x + out[i].y * out[i].y;
    cudaFree(din); cudaFree(dout);
    delete[] in;  delete[] out;
    cufftDestroy(plan);
    std::cout << "GPU calculation: " << std::endl;
    }
  else {
    // CPU code path
    fftw_complex *in, *out;
    fftw_plan p;
    in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    for (int i = 0; i < N; i++) {in[i].re= data[i]; in[i].im = 0;}
    p = fftw_create_plan(N, FFTW_FORWARD, FFTW_ESTIMATE);
    fftw_one(p, in, out);
    fftw_destroy_plan(p);
    for (int i = 0; i < N; i++) data[i] = out[i].re * out[i].re + out[i].im * out[i].im;
    fftw_free(in); fftw_free(out);
    std::cout << "CPU calculation: " << std::endl;
    }
  for (int i = 0; i < N; i++)
    std::cout << data[i] << ", ";
  std::cout << std::endl;
  return 0;
}
$ nvcc t467.cu -o t467 -lcufft_static -lculibos -lfftw -lm
$ ./t467
GPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$ CUDA_VISIBLE_DEVICES="" ./t467
CPU calculation:
0, 0, 16, 0, 0, 0, 16, 0,
$

请注意，上面的示例仍然与fftw动态链接，因此您的执行环境(CPU和GPU)都需要具有适当的fftwX.so库.如何使linux可执行文件在各种设置下工作(CUDA依赖关系之外)的一般过程超出了本示例或我打算回答的范围.在Linux上， ldd 是您的朋友.

Note that the above example still links dynamically against fftw, so your execution environment (both CPU and GPU) needs to have an appropriate fftwX.so library available. The general process of how to make a linux executable work in a variety of settings (outside of CUDA dependencies) is beyond the scope of this example or what I intend to answer. On linux, ldd is your friend.

这篇关于运行时GPU或CPU执行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！