When I try to put AMP code between #pragma pack(4)   and #pragma pack () , I get error warning that AMP do not support pragma pack.

Is there some default #pragma pack that AMP use (4?), or how AMP handles structure rearrangements?

Reason why I ask this is that I made C++ library which enables me to write simple host and kernel code once and execute on any CPU, AMP or CUDA device. It looks something like this:


#include "gxKernel.h"

struct tResult	{ uEmu64 sum;	};

	uEmu64 m=_Counter;
	m*=_Counter;  // m= i^2
	_Data.sum+=m; // sum(1..N) i^2


#include "gxLauncher.h"

int main(){
	// get problem related params and run on GPUs
	gxClass_NP(cs, tResult);
	uInt n=3000000;
	cs.doWork( n+1 );
	// aggregate and show results
	uInt64 sumPowers=0;
	for (int i=0; i<cs.N; i++)	sumPowers+=cs.Data[i].sum;
	printf("\n\nSum(i=1..%d) of i^2 = %lld\n", n, sumPowers );
	// exit
    return 0;

Console output when run:

iD|Mode | #thrd |iterations|done|  iter/s | sz/thrd | Name
 0| CPU |     0 |          |    |     *DISABLED*    | CPU(8c) i7 950 @ 3.07GHz
 1|CUDA | 24576 |    2.96M | 99%|   0.00k |   1.00  | GeForce GTX 780
 2| AMP | 32768 |   32.76k |  1%| 574.00k |   1.00  | Radeon HD 6900
 2|     | 57344 |    3.00M |100%| 574.00k |      0s | ETA: 0s
Finished 3000001 iterations in 588 ms.

Sum(i=1..3000000) of i^2 = 9000004500000500000

Above console output shows correct result (9000004500000500000), and it shows that it run kernels on both CUDA and AMP (relative speeds/numbers per GPU can be ignored here, since this had only 3mil iterations and finished as soon as it begun, in longer run distribution between CUDA / AMP would be closer to relative card speeds, although here AMP also have handicap or slower emulated 64bit )

But main point is data structure defined in Kernel.cpp, and used in kernels (structtResult) , that needs to be compiled and used  in three kernels (for CPU, AMP and CUDA ... even if only one code is written, my gxLibrary is replicating that code in other gxLibrary files and compiling them as CUDA or CPU), and on Host side (in Main.cpp).

There is no problem with that when user structure contains only 32bit members. But here I used uEmu64, which is class (struct) I made to emulate 64bit unsigned integer, missing on AMP.  My gxLibrary is doing some optimization by aliasing uEmu64 to unsigned _Int64 in CPU code (both kernels and host) and CUDA code, while using my _emU64 { uInt32 lo,hi;} structure to emulate on AMP.

And there is also no problem with that approach when uEmu64 is used alone, or when it is aligned with 8 bytes.

BUT problem happens if user define and use some structure of type:

struct tResult	{
	uInt32 a;
	uEmu64 sum;
	uInt32 b;

It took me some time to pinpoint problem, and this time it is not AMP fault, but regular VS C++ compiler "fault" on CPU side. Compiler decides that having 64bit member at 32bit boundary is not smart ( which is true, but...) and then compiler rearrange internal position of members, for example making it {sum,a,b}. From some reason, NVCC CUDA compiler always match same rearranging (my gxLibrary result in code that is compiled for CUDA too), and that leaves only AMP who faithfully remain at original {a,sum,b}.

It is obvious that end result is bug in program, since CPU/host side will read values from wrong positions in that struct if it was filled by AMP code.

BTW, above struct example is bad programming practice from performance side ( in structures largest members should go first , 64bit members on 64bit boundaries etc) but I'm writing gxLibrary so anyone can use it ... and "anyone" will include programmers who simply want to set up struct in any order they want, and expect it to work ;p

After detecting problem, I tried to force CPU not to rearrange by using volatile , but that did not work.  Only solution I found, that I can implement in gxLauncher.h header without changing project properties (which is needed if I want to be sure it will work in new user project) is to enclose replicated "kernel.cpp" code in :

#pragma pack(4)
#include "kernel.cpp"
#pragma pack()

And that currently fixed problem - but only when I avoided any #pragma around AMP copy of kernel.  Now, I assume that AMP either use pack(4) or does not rearrange struct in any situations, butit would be good to know how AMP work with member alignment inside structures, in order to know if my workaround with #pragma pack(4)  will suddenly stop working in future or not .



