本文介绍了AMP的默认#pragma包是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将AMP代码放在#pragma pack(4)之间时和#pragma pack(),我收到错误警告,提示AMP不支持pragma pack.

AMP是否使用一些默认的#pragma包(4?),或者AMP如何处理结构重排?

为什么我要问这个原因是我制作了C ++库,使我能够编写一次简单的主机和内核代码,并可以在任何CPU,AMP或CUDA设备上执行.看起来像这样:

Kernel.CPP

#include"gxKernel.h"

struct tResult {uEmu64 sum; };

GXBEGIN_NP(tResult)
uEmu64 m = _Counter;
m * = _ Counter; //m = i ^ 2
_Data.sum + = m; //sum(1..N)i ^ 2
GXEND 

Main.CPP

#include"gxLauncher.h"

int main(){
//获取与问题相关的参数并在GPU上运行
gxClass_NP(cs,tResult);
uInt n = 3000000;
cs.doWork(n + 1);
//汇总并显示结果
uInt64 sumPowers = 0;
对于(int i = 0; i< cs.N; i ++)sumPowers + = cs.Data [i] .sum;
printf("i ^ 2的\ n \ nSum(i = 1 ..%d)=%lld \ n",n,sumPowers);
//出口
_getch();
    返回0;
} 

运行时控制台输出:

 iD |模式| #thrd |迭代次数|完成| iter/s | sz/thrd |姓名
-------------------------------------------------- ----------------------------
 0 | CPU | 0 | | | *已禁用* | CPU(8c)i7 950 @ 3.07GHz
 1 | CUDA | 24576 | 2.96M | 99%| 0.00千| 1.00 | GeForce GTX 780
 2 | AMP | 32768 | 32.76千| 1%| 574.00k | 1.00 | Radeon HD 6900
================================================== ===========================
 2 | | 57344 | 300万| 100%| 574.00k | 0s |预计到达时间:0秒
-------------------------------------------------- ----------------------------
在588毫秒内完成3000001次迭代.
-------------------------------------------------- ----------------------------


i ^ 2的总和(i = 1..3000000)= 9000004500000500000 

上面的控制台输出显示正确的结果(9000004500000500000),并显示它在CUDA和AMP上运行内核(此处可以忽略每个GPU的相对速度/数量,因为它只有3百万次迭代并在开始时就完成了,从长远来看 CUDA/AMP之间的分配将更接近相对的卡速度,尽管在这里AMP也有障碍或较慢的模拟64位)

但是要点是在Kernel.cpp中定义的数据结构,并在内核中使用(结构 tResult ),需要进行编译和使用.在三个内核中(对于CPU,AMP和CUDA ...,即使只编写了一个代码,我的gxLibrary也在其他gxLibrary文件中复制了该代码并将其编译为CUDA或CPU),并在主机上 侧面(在Main.cpp中).

当用户结构仅包含32位成员时,这没有问题.但是在这里,我使用了uEmu64,它是我模拟AMP上缺少的64位无符号整数的类(结构).我的gxLibrary通过将uEmu64别名为unsigned来进行一些优化 使用我的_emU64 {uInt32 lo,hi;}结构在AMP上仿真时,CPU代码(内核和主机)和CUDA代码中的_Int64.

单独使用uEmu64或将其与8个字节对齐时,该方法也没有问题.

但如果用户定义并使用某种类型的结构,则会发生以下问题:

结构tResult {
uInt32 a;
uEmu64 sum;
uInt32 b;
}; 

我花了一些时间来查明问题,这次不是AMP错误,而是常规的VS C ++编译器故障".在CPU端.编译器认为在32bit边界具有64bit成员并不明智(这是正确的,但是...),然后编译器 重新排列成员的内部位置,例如将其设置为{sum,a,b}.由于某种原因,NVCC CUDA编译器总是匹配相同的重排(我的gxLibrary结果也为CUDA编译了代码),并且只有AMP忠实地保留在原始位置 {a,sum,b}.

很明显,最终结果是程序中的错误,因为如果CPU/主机端被AMP代码填充,则会从该结构的错误位置读取值.

BTW,上面的结构示例从性能方面来说是不好的编程习惯(在结构中,最大的成员应该首先出现,在64位边界上是64位成员,等等),但是我正在编写gxLibrary,以便任何人都可以使用它……而任何人"都可以使用它.将包括程序员 他们只是想以他们想要的任何顺序来建立结构,并期望它能工作; p

检测到问题后,我尝试使用强制CPU不要重新排列易变,但这没用.我发现可以在gxLauncher.h标头中实现而不更改项目属性的唯一解决方案(如果我想确保它可以在新的用户项目中使用,则需要这样做)是将复制的"kernel.cpp" 代码:

#pragma pack(4)
#include"kernel.cpp"
#pragma pack()


而当前已解决的问题-但只有当我避免在内核的AMP副本周围出现任何#pragma时,才可以解决.现在,我假设AMP在任何情况下都不使用pack(4)或不重新排列结构,但是最好了解AMP如何在结构内部进行成员对齐,以了解我的#pragma pack(4)解决方案是否可行将来会突然停止工作.


解决方案


When I try to put AMP code between #pragma pack(4)   and #pragma pack () , I get error warning that AMP do not support pragma pack.

Is there some default #pragma pack that AMP use (4?), or how AMP handles structure rearrangements?

Reason why I ask this is that I made C++ library which enables me to write simple host and kernel code once and execute on any CPU, AMP or CUDA device. It looks something like this:

Kernel.CPP

#include "gxKernel.h"

struct tResult	{ uEmu64 sum;	};

GXBEGIN_NP(tResult)
	uEmu64 m=_Counter;
	m*=_Counter;  // m= i^2
	_Data.sum+=m; // sum(1..N) i^2
GXEND

Main.CPP

#include "gxLauncher.h"

int main(){
	// get problem related params and run on GPUs
	gxClass_NP(cs, tResult);
	uInt n=3000000;
	cs.doWork( n+1 );
	// aggregate and show results
	uInt64 sumPowers=0;
	for (int i=0; i<cs.N; i++)	sumPowers+=cs.Data[i].sum;
	printf("\n\nSum(i=1..%d) of i^2 = %lld\n", n, sumPowers );
	// exit
	_getch();
    return 0;
}

Console output when run:

iD|Mode | #thrd |iterations|done|  iter/s | sz/thrd | Name
------------------------------------------------------------------------------
 0| CPU |     0 |          |    |     *DISABLED*    | CPU(8c) i7 950 @ 3.07GHz
 1|CUDA | 24576 |    2.96M | 99%|   0.00k |   1.00  | GeForce GTX 780
 2| AMP | 32768 |   32.76k |  1%| 574.00k |   1.00  | Radeon HD 6900
==============================================================================
 2|     | 57344 |    3.00M |100%| 574.00k |      0s | ETA: 0s
------------------------------------------------------------------------------
Finished 3000001 iterations in 588 ms.
------------------------------------------------------------------------------


Sum(i=1..3000000) of i^2 = 9000004500000500000

Above console output shows correct result (9000004500000500000), and it shows that it run kernels on both CUDA and AMP (relative speeds/numbers per GPU can be ignored here, since this had only 3mil iterations and finished as soon as it begun, in longer run distribution between CUDA / AMP would be closer to relative card speeds, although here AMP also have handicap or slower emulated 64bit )

But main point is data structure defined in Kernel.cpp, and used in kernels (structtResult) , that needs to be compiled and used  in three kernels (for CPU, AMP and CUDA ... even if only one code is written, my gxLibrary is replicating that code in other gxLibrary files and compiling them as CUDA or CPU), and on Host side (in Main.cpp).

There is no problem with that when user structure contains only 32bit members. But here I used uEmu64, which is class (struct) I made to emulate 64bit unsigned integer, missing on AMP.  My gxLibrary is doing some optimization by aliasing uEmu64 to unsigned _Int64 in CPU code (both kernels and host) and CUDA code, while using my _emU64 { uInt32 lo,hi;} structure to emulate on AMP.

And there is also no problem with that approach when uEmu64 is used alone, or when it is aligned with 8 bytes.

BUT problem happens if user define and use some structure of type:

struct tResult	{
	uInt32 a;
	uEmu64 sum;
	uInt32 b;
};

It took me some time to pinpoint problem, and this time it is not AMP fault, but regular VS C++ compiler "fault" on CPU side. Compiler decides that having 64bit member at 32bit boundary is not smart ( which is true, but...) and then compiler rearrange internal position of members, for example making it {sum,a,b}. From some reason, NVCC CUDA compiler always match same rearranging (my gxLibrary result in code that is compiled for CUDA too), and that leaves only AMP who faithfully remain at original {a,sum,b}.

It is obvious that end result is bug in program, since CPU/host side will read values from wrong positions in that struct if it was filled by AMP code.

BTW, above struct example is bad programming practice from performance side ( in structures largest members should go first , 64bit members on 64bit boundaries etc) but I'm writing gxLibrary so anyone can use it ... and "anyone" will include programmers who simply want to set up struct in any order they want, and expect it to work ;p

After detecting problem, I tried to force CPU not to rearrange by using volatile , but that did not work.  Only solution I found, that I can implement in gxLauncher.h header without changing project properties (which is needed if I want to be sure it will work in new user project) is to enclose replicated "kernel.cpp" code in :

#pragma pack(4)
#include "kernel.cpp"
#pragma pack()


And that currently fixed problem - but only when I avoided any #pragma around AMP copy of kernel.  Now, I assume that AMP either use pack(4) or does not rearrange struct in any situations, butit would be good to know how AMP work with member alignment inside structures, in order to know if my workaround with #pragma pack(4)  will suddenly stop working in future or not .


解决方案


这篇关于AMP的默认#pragma包是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-18 17:28