问题描述
我正在编写一个 cuda 内核,它需要我在设备上分配一个对齐的 struct
数组.我从我的计算中得到了正确的结果,我需要从索引 0
开始将值写入这个数组.
I am writing a cuda kernel which requires me to allocate an array of aligned struct
on the device.I am getting the correct results from my computations and I need to write the values to this array starting from index 0
.
当我尝试写入此数组并将结果显示回主机端时,一些答案显示为零.
When I try to write to this array and display the results back to host side, some of the answers are displayed as zero.
显然,我并没有按照我的要求增加索引.我尝试使用我使用 atomicAdd()
增加的计数器,但是我仍然得到一些值为零.
Clearly, I am not increasing the index as per my requirement. I tried using counter which I increase using atomicAdd()
, however I still get some values as zero.
确切地说,我可以在我的内核中使用 1000
线程进行计算,但我的输出分配数组的大小可能小于 100
或大于 10000
.
To be precise, I may use 1000
threads in my kernel for computations but my output allocated array can have a size less than 100
or more than 10000
.
我的问题是,如何让所有这些线程将值准确写入数组的一个位置(因为它们是计算出来的),并将数组索引/计数器增加 1
而不会覆盖它.
My question is, how do I make all these threads write the value to exactly one location of array ( as they are calculated ) and increment the array index/counter by 1
without overwriting it.
任何帮助将不胜感激.在此先感谢.
Any help will be appreciated.Thanks in advance.
推荐答案
你可以使用atomicAdd()
.它返回旧值,因此您使用该值作为索引:
You can use atomicAdd()
. It returns the old value, so you use that value as the index:
old_i = atomicAdd(&i, 1);
out_array[old_i] = val
但是,如果您的许多线程写出结果,您的性能会很差,因为 atomicAdd() 将(间接)序列化所有写入.在这种情况下,您应该让每个线程将其结果(如果有)写入为该线程预留的插槽,然后使用压缩算法(参见 thrust::copy_if
)收集结果.
However, you will get poor performance if many of your threads write out results, as the atomicAdd() will (indirectly) serialize all the writes. In that case, you should let each thread write its result,if any, to a slot set aside for that thread and then use a compaction algorithm (see thrust::copy_if
), to gather up the results.
这篇关于多个线程写入 cuda 内核中的顺序数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!