问题描述
我有一个类在其构造函数中调用内核,如下所示:
ScalarField.h
解决方案strong>简短版本:
A类主要的是,在调用 class A 的构造函数之前,不会运行使用您的内核初始化CUDA运行时库所需的特定挂接例程。这是因为没有保证在C ++执行模型中实例化和初始化静态对象的顺序。您的全局作用域类将在初始化CUDA设置的全局作用域对象之前进行实例化。您的内核代码在调用之前从不会加载到上下文中,并导致运行时错误。
尽可能地告诉我,这是一个真正的限制CUDA运行时API,而不是在用户代码中轻松修复的东西。在你的例子中,你可以使用 cudaMemset 或者一个基于非符号的运行时API函数来替换内核调用,它会工作。这个问题完全限于通过运行时API在运行时加载的用户内核或设备符号。因此,一个空的默认构造函数也会解决你的问题。从设计的角度来看,我会非常怀疑在构造函数中调用内核的任何模式。添加一个特定的类GPU设置/拆卸的方法不依赖于默认的构造函数或析构函数将是一个更干净,更少的容易出错的设计,IMHO。
strong>详细:
有一个内部生成的例程( __ cudaRegisterFatBinary ),必须运行在无错误地调用内核之前,使用CUDA驱动程序API加载和注册任何运行时API程序的fatbin有效内容中包含的内核,纹理和静态定义的设备符号。这是运行时API的延迟上下文初始化特性的一部分。您可以自己确认如下:
这是您发布的修订示例的gdb跟踪。注意我在 __ cudaRegisterFatBinary 中插入一个断点,并且在调用静态 A 构造函数之前没有到达,发布失败:
talonmies @ box:〜$ gdb a.out
GNU gdb(Ubuntu / Linaro 7.4-2012.04 -0ubuntu2.1)7.4-2012.04
版权所有(C)2012自由软件基金会,
许可GPLv3 +:GNU GPL第3版或更高版本< http://gnu.org/licenses/gpl。 html>
这是免费软件:您可以自由更改和重新分配。
在法律允许的范围内,没有任何保证。有关详细信息,请键入显示复制
和显示保修。
此GDB配置为x86_64-linux-gnu。
有关错误报告说明,请参阅:
< http://bugs.launchpad.net/gdb-linaro/> ...
从/ home / talonmies / a读取符号.out ... done。
(gdb)break'__cudaRegisterFatBinary'
断点1在0x403180
(gdb)run
启动程序:/home/talonmies/a.out
[线程调试使用libthread_db enabled]
使用主机libthread_db库/lib/x86_64-linux-gnu/libthread_db.so.1。
标量字段
[新主题0x7ffff5a63700(LWP 10774)]
A类
内核:设备功能无效
[线程0x7ffff5a63700(LWP 10774)退出]
[使用代码0377退出的下级1(过程10771)]
这里是相同的过程,在 main (确保在执行延迟设置的对象已经初始化之后发生)中的 A / p>
talonmies @ box:〜$ cat main.cu
#includeclassA.h
int main(){
a_object;
std :: cout<< 主< std :: endl;
return 0;
}
talonmies @ box:〜$ nvcc --keep -arch = sm_30 -g main.cu
talonmies @ box:〜$ gdb a.out
GNU gdb(Ubuntu / Linaro 7.4-2012.04-0ubuntu2.1)7.4-2012.04
版权所有(C)2012自由软件基金会,
许可GPLv3 +:GNU GPL第3版或更高版本< http: /gnu.org/licenses/gpl.html>
这是免费软件:您可以自由更改和重新分配。
在法律允许的范围内,没有任何保证。有关详细信息,请键入显示复制
和显示保修。
此GDB配置为x86_64-linux-gnu。
有关错误报告说明,请参阅:
< http://bugs.launchpad.net/gdb-linaro/> ...
从/ home / talonmies / a读取符号.out ... done。
(gdb)break'__cudaRegisterFatBinary'
断点1在0x403180
(gdb)run
启动程序:/home/talonmies/a.out
[线程调试使用libthread_db enabled]
使用主机libthread_db库/lib/x86_64-linux-gnu/libthread_db.so.1。
断点1,0x0000000000403180在__cudaRegisterFatBinary()
(gdb)cont
继续。
标量字段
[新主题0x7ffff5a63700(LWP 11084)]
A类
主
[线程0x7ffff5a63700(LWP 11084)退出]
[ (process 11081)exited normally]
如果这真的是一个致命的问题, NVIDIA开发人员支持并提出错误报告。
I have a class that calls a kernel in its constructor, as follows:
"ScalarField.h"
#include <iostream> void ERROR_CHECK(cudaError_t err,const char * msg) { if(err!=cudaSuccess) { std::cout << msg << " : " << cudaGetErrorString(err) << std::endl; std::exit(-1); } } class ScalarField { public: float* array; int dimension; ScalarField(int dim): dimension(dim) { std::cout << "Scalar Field" << std::endl; ERROR_CHECK(cudaMalloc(&array, dim*sizeof(float)),"cudaMalloc"); } };"classA.h"
#include "ScalarField.h" static __global__ void KernelSetScalarField(ScalarField v) { int index = threadIdx.x + blockIdx.x * blockDim.x; if (index < v.dimension) v.array[index] = 0.0f; } class A { public: ScalarField v; A(): v(ScalarField(3)) { std::cout << "Class A" << std::endl; KernelSetScalarField<<<1, 32>>>(v); ERROR_CHECK(cudaGetLastError(),"Kernel"); } };"main.cu"
#include "classA.h" A a_object; int main() { std::cout << "Main" << std::endl; return 0; }If i instantiate this class on main (A a_object;) i get no errors. However, if I instantiate it outside main, just after defining it (class A {...} a_object;) I get an "invalid device function" error when the kernel launches. Why does that happen?
EDIT
Updated code to provide a more complete example.
EDIT 2
Following the advice in the comment by Raxvan, I wanted to say i have the dimensions variable used in ScalarField constructor also defined (in another class) outside main, but before everything else. Could that be the explanation? The debugger was showing the right value for dimensions though.
解决方案The short version:
The underlying reason for the problem when class A is instantiated outside of main is that a particular hook routine which is required to initialise the CUDA runtime library with your kernels is not being run before the constructor of class A is being called. This happens because there are no guarantees about the order in which static objects are instantiated and initialised in the C++ execution model. Your global scope class is being instantiated before the global scope objects which do the CUDA setup are initialised. Your kernel code is never being loaded into the context before it is call, and a runtime error results.
As best as I can tell, this is a genuine limitation of the CUDA runtime API and not something easily fixed in user code. In your trivial example, you could replace the kernel call with a call to cudaMemset or one of the non-symbol based runtime API memset functions and it will work. This problem is completely limited to user kernels or device symbols loaded at runtime via the runtime API. For this reason, an empty default constructor would also solve your problem. From a design point of view, I would be very dubious of any pattern which calls kernels in the constructor. Adding a specific method for class GPU setup/teardown which doesn't rely on the default constructor or destructor would be a much cleaner and less error prone design, IMHO.
In detail:
There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error. This is a part of the "lazy" context initialisation feature of the runtime API. You can confirm this for yourself as follows:
Here is a gdb trace of the revised example you posted. Note I insert a breakpoint into __cudaRegisterFatBinary, and that isn't reached before your static A constructor is called and the kernel launch fails:
talonmies@box:~$ gdb a.out GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://bugs.launchpad.net/gdb-linaro/>... Reading symbols from /home/talonmies/a.out...done. (gdb) break '__cudaRegisterFatBinary' Breakpoint 1 at 0x403180 (gdb) run Starting program: /home/talonmies/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Scalar Field [New Thread 0x7ffff5a63700 (LWP 10774)] Class A Kernel : invalid device function [Thread 0x7ffff5a63700 (LWP 10774) exited] [Inferior 1 (process 10771) exited with code 0377]Here is the same procedure, this time with A instantiation inside main (which is guaranteed to happen after the objects which perform lazy setup have been initialised):
talonmies@box:~$ cat main.cu #include "classA.h" int main() { A a_object; std::cout << "Main" << std::endl; return 0; } talonmies@box:~$ nvcc --keep -arch=sm_30 -g main.cu talonmies@box:~$ gdb a.out GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://bugs.launchpad.net/gdb-linaro/>... Reading symbols from /home/talonmies/a.out...done. (gdb) break '__cudaRegisterFatBinary' Breakpoint 1 at 0x403180 (gdb) run Starting program: /home/talonmies/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary () (gdb) cont Continuing. Scalar Field [New Thread 0x7ffff5a63700 (LWP 11084)] Class A Main [Thread 0x7ffff5a63700 (LWP 11084) exited] [Inferior 1 (process 11081) exited normally]If this is really a crippling problem for you, I would suggest contacting NVIDIA developer support and raising a bug report.
这篇关于从静态初始化代码启动CUDA内核时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!