从主机访问CUDA全局设备变量 | 从主机访问CUDA全局设备变量

本文介绍了从主机访问CUDA全局设备变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道是否有官方消息来源，为什么以下方法起作用：

I was wondering if there is an official source, why the following works:

#include <iostream>

struct Array{
    int el[10000];
};

__device__ Array devAr;

void test(Array& ar = devAr){
    for(int i=0; i<10000; i++)
        ar.el[i] = i;
    std::cout << ar.el[0] + ar.el[9999] << std::endl;
}

int main(){
    test();
}

您将收到警告 __device__变量 devAr无法直接读取主机功能，如果您尝试直接访问devAr，但通过引用则没有此类警告（有充分的理由）。但是在两种情况下，都可以从主机访问变量。似乎有一个该变量的宿主实例。

You get a warning "a __device__ variable "devAr" cannot be directly read in a host function" if you try to access devAr directly but through the reference there is no such warning (for good reason). But in both cases it is possible to access the variable from the host. So it seems, there is a host instance of that variable.

我需要知道的事情：我可以认为这是理所当然的吗？

What I need to know: Can I take this for granted?

其他显示指针值的测试用例：

Other testcase showing the values of the pointers:

#include <iostream>
#include <cstdio>

__device__ int devAr[2];

__global__ void foo(){
    printf("Device: %p\n", &devAr);
    devAr[0] = 1337;
}

int main()
{
    devAr[0] = 4;
    std::cout << devAr[0] << std::endl;
    void* ad;
    cudaGetSymbolAddress(&ad, devAr);
    std::cout << ad << " " << &devAr << std::endl;
    foo<<<1,1>>>();
    cudaDeviceSynchronize();
    int arHost[2];
    cudaMemcpyFromSymbol(arHost, devAr, sizeof(arHost), 0);
    std::cout << "values: " << arHost[0] << std::endl;
}

输出：

推荐答案

您做的是无效的，应该听警告：

What you are doing is invalid and you should listen to the warning:

首先，我将代码简化为仅显示问题所需的大小：

First let me simplify your code a bit to only size necessary to show the issue:

#include <iostream>

__device__ int devAr[1];

int main()
{
    devAr[0] = 4;
    std::cout << devAr[0] << std::endl;
}

现在发生了什么事：

__ device__ int devAr [1]; 在设备内存中分配固定大小的数组，并将指针存储到该设备内存放在 devAr 变量中（出现警告）。

devAr 地址指向有效的设备存储器，但是，即使主机代码中也可以使用该地址，因为主机和设备存储器使用的地址格式相同。但是，在主机代码 devAr 中指向一些随机的未初始化的主机内存。

基于上文可以说 devAr [0] = 4; 只是将 4 写入主机内存中的某个随机未初始化位置。

__device__ int devAr[1]; allocates fixed size array in device memory and stores the pointer to this device memory inside the devAr variable (hence the warning).
The devAr address points to valid piece of device memory, however, such address can be used even in host code, because host and device memory use the addresses in the same format. However, in host code devAr points to some random uninitialized piece of host memory.
Based on above one can say that devAr[0] = 4; just writes 4 into some random uninitialized location in host memory.

尝试运行以下代码，也许它可以帮助您了解实际情况：

Try running the following code, perhaps it will help you understand what is happening under the hood:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

using namespace std;

__device__ int devAr[1];

__global__ void foo()
{
    printf("dev: %d \n", devAr[0]);
    devAr[0] = 5;
    printf("dev: %d \n", devAr[0]);
}

int main()
{
    cout << "host: " << devAr[0] << endl;
    devAr[0] = 4;
    cout << "host: " << devAr[0] << endl;

    foo << <1, 1 >> >();
    cudaDeviceSynchronize();
    cout << "host: " << devAr[0] << endl;
}

输出将是：

host: 0
host: 4
dev: 0
dev: 5
host: 4

更新：

澄清后在下面的评论中您要问的问题我开始研究这个问题并发现了几个相关的SO线程，大多数引用来自答案下面的评论，它们是：

After clarifying what you are asking in the below comments I started digging in the issue and found couple of related SO threads, most of the quotations come from the comments below the answers, here they are:

：

cudaMemcpyFromSymbol on a __device__ variable:

基于这些证据，让我尝试从上面更新我原来的3步解释：

Based on this evidence let me try to update my original 3step explanation from above:

__ device__ int devAr [1]; 在设备内存中分配固定大小的数组，并将挂钩放入运行时设备符号查找存储到主机版本的 devAr 变量（请参阅链接的资源1和3）。

devAr 地址只是一个垃圾。从主机的角度来看，应仅与符号API调用一起使用，例如 cudaGetSymbolAddress （所有链接的资源似乎都支持该理论），因为它映射到设备 devAr 变量的版本。

__device__ int devAr[1]; allocates fixed size array in device memory and stores "hooks into a runtime device symbol lookup" into the host version of devAr variable (see linked resources 1 and 3).
The devAr address is just a garbage from host's point of view and should only be used with the symbol API calls, such as cudaGetSymbolAddress (all of the linked resources appear to support this theory) because it maps to the device version of devAr variable.

我无法提出任何建议更具体的，例如指向CUDA文档的链接，但我希望现在已经足够清楚了。总而言之，您现在似乎已经可以保证上述行为（例如，主机和设备版本为 devAr 变量），但对我而言，它似乎是除了符号API调用之外，您不应依赖且不应使用 devAr 变量的主机版本的实现细节。

I was not able to come up with anything "more concrete" such as link to CUDA documentation but I hope this is now clear enough. All in all it seems like you now have a guarantee for the behavior described above (i.e. there is a host and device version of devAr variable) but to me it rather appears as an implementation detail which you should not rely on and should not use host version of devAr variable for purposes other than symbol API calls.