问题描述
我试图找到解决问题的最佳工作组规模,但发现一些我自己无法证明的事情.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
这些是我的结果:
- GlobalWorkSize {6400 6400 1},WorkGroupSize {64 4 1},时间(毫秒)= 44.18
- GlobalWorkSize {6400 6400 1},WorkGroupSize {4 64 1},时间(毫秒)= 24.39
交换轴使执行速度提高了两倍.为什么!
Swapping axes caused a twice faster execution. Why !?
顺便说一句,我正在使用AMD GPU.
By the way, I was using an AMD GPU.
谢谢:-)
这是内核(简单矩阵转置):
EDIT :This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
推荐答案
我同意@Thomas,它很可能取决于您的内核.在第二种情况下,很可能您以合并方式访问内存和/或充分利用内存事务.
I agree with @Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
凝聚:当线程需要访问内存中的元素时,硬件将尝试以尽可能少的事务访问这些元素,即,如果线程0和线程1必须访问连续的元素,则将只能是一笔交易.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
充分利用内存事务:假设您有一个GPU可以在一次事务中提取32个字节.因此,如果您有4个线程需要每个获取一个int,则仅使用事务处理获取的数据的一半.您浪费了其余的空间(假设一个int是4个字节).
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
为了说明这一点,假设您有一个n x n矩阵可访问.您的矩阵是以行为主的,并且您使用一维组织的n个线程.您有两种可能:
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
- 每个工作项只处理一列,一次遍历每个列元素.
- 每个工作项只处理一行,一次遍历每个行元素.
这可能是违反直觉的,但是第一个解决方案将能够进行合并访问,而第二个解决方案则不能.原因是,当第一个工作项需要访问第一列中的第一个元素时,第二个工作项将访问第二列中的第一个元素,依此类推.这些元素在内存中是连续的.对于第二种解决方案,情况并非如此.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
现在,如果您使用相同的示例,并应用解决方案1,但是这次您有4个工作项而不是n,并且我刚才已经说过相同的GPU,那么自从您将时间最多增加2倍之前,将浪费您一半的内存交易.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
:既然您发布了内核,我发现我忘记提及其他内容了.
Now that you posted your kernel I see that I forgot to mention something else.
对于您的内核,似乎选择(1,256)或(256,1)的局部大小总是一个不好的选择.在第一种情况下,在输入中读取一列将需要256个事务(每个读取32个字节,其中仅会使用4个字节-请记住与我之前示例相同的GPU),而在输入中则需要32个事务来写入一列:您可以在一个事务中写入8个浮点数,因此需要32个事务来写入256个元素.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
这是相同的问题,工作组大小为(256,1),但是这次使用32个事务读取,而256个事务写入.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
那么为什么第一个尺寸更好呢?这是因为有一个缓存系统,可以减轻对读取部分的错误访问.因此,大小(1,256)对于写部分来说是合适的,并且缓存系统处理不太好读的部分,从而减少了必需的读事务的数量.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
请注意,事务数量总体上减少了(考虑了NDRange中的所有工作组).例如,第一个工作组发出256个事务,以读取第一列的256个第一元素.第二个工作组可能只是进入缓存以检索第二列的元素,因为它们是由第一个工作组发出的事务(32字节)获取的.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
现在,我几乎可以肯定,您可以比(1,256)试试(8,32)做得更好.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).
这篇关于OpenCL-工作组轴可以互换吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!