问题描述
Mersenne Twister
(MT
)随机数生成器的CUDA实现仅限于256
和200
个块/网格的最大线程/块数,即最大线程数是51200
CUDA's implementation of the Mersenne Twister
(MT
) random number generator is limited to a maximal number of threads/blocks of 256
and 200
blocks/grid, i.e. the maximal number of threads is 51200
.
因此,无法启动使用MT的内核
Therefore, it is not possible to launch the kernel that uses the MT with
kernel<<<blocksPerGrid, threadsPerBlock>>>(devMTGPStates, ...)
其中
int blocksPerGrid = (n+threadsPerBlock-1)/threadsPerBlock;
和n
是线程总数.
将MT
用于threads > 51200
的最佳方法是什么?
What is the best way to use the MT
for threads > 51200
?
我的方法是对blocksPerGrid
和threadsPerBlock
使用常量值,例如<<<128,128>>>
,并在内核代码中使用以下代码:
My approach if to use constant values for blocksPerGrid
and threadsPerBlock
, e.g. <<<128,128>>>
and use the following in the kernel code:
__global__ void kernel(curandStateMtgp32 *state, int n, ...) {
int id = threadIdx.x+blockIdx.x*blockDim.x;
while (id < n) {
float x = curand_normal(&state[blockIdx.x]);
/* some more calls to curand_normal() followed
by the algorithm that works with the data */
id += blockDim.x*gridDim.x;
}
}
我不确定这是正确的方法还是会以不希望的方式影响MT状态?
I am not sure if this is the correct way or if it can influence the MT status in an undesired way?
谢谢.
推荐答案
我建议您阅读CURAND 文档,仔细而彻底.
I suggest you read the CURAND documentation carefully and thoroughly.
当每个块使用256个线程(最多64个块)生成数字时,MT API效率最高.
The MT API will be most efficient when using 256 threads per block with up to 64 blocks to generate numbers.
如果您需要的还不止这些,您可以有多种选择:
If you need more than that, you have a variety of options:
- 仅根据现有状态生成更多数字-集(即64块,256个线程),并将这些数字分配到需要它们的线程.
- 每个块使用不止一个状态(但这不允许您超出状态集中的整体限制,它只是解决了单个块的需要.)
- 创建具有独立种子(因此具有独立状态集)的多个MT生成器.
- simply generate more numbers from the existing state - set (i.e. 64blocks, 256 threads), and distribute these numbers amongst thethreads that need them.
- Use more than a single state per block (but this does not allow you to exceed the overall limit within a state-set, it just addresses the need for a single block.)
- Create multiple MT generators with independent seeds (and therefore independent state-sets).
通常,我看不到您概述的内核有问题,并且与上面的选择1大致相符.但是,它不允许您超过51200个线程. (您的示例具有<<<128, 128>>>
,因此有16384个线程)
Generally, I don't see a problem with the kernel that you've outlined, and it's roughly in line with choice 1 above. However it does not allow you to exceed 51200 threads. (your example has <<<128, 128>>>
so 16384 threads)
这篇关于CUDA的Mersenne Twister用于任意数量的线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!