如何通过来自不同内核的线程执行相同全局内存地址的访问？

本文介绍了如何通过来自不同内核的线程执行相同全局内存地址的访问？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个问题：

如果一个线程中的许多线程想读取全局内存中的地址，那么这个数据会被广播，是吗？

If many threads in a warp want to read an adress in global memory, this data is broadcasted, is that right?

如果一个线程中有很多线程想要写入全局内存中的地址，就有一个序列化，但是不可能预测顺序，是不是？

If many threads in a warp want to write into an adress in global memory, there is a serialization, but is not possible to predict the order, is that right?

但是，第一个问题：如果许多线程在不同的warp中，在不同的块，想要写入全局内存中的地址？什么GPU gonna做？序列化所有访问此地址？是否有数据一致性的保证？

But, the first question: If many threads in a different warps, in different blocks, want to write into an adress in global memory? What the GPU gonna do? Serializes all the access to this address? Is there any guarantee of data consistence?

使用Hyper-Q可以启动大量包含内核的流。如果我在内存中有一个属性，并且不同内核中的多个线程想要写入或读取这个地址，GPU会做什么？序列化来自不同内核的所有线程的访问，或者GPU什么也不做，一些不一致会发生？当多个内核正在读/写入同一个地址时，是否有数据一致性的保证？

With Hyper-Q is possible to launch a lot of streams containing kernels. If I have a possition in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the access of all threads from different kernels, or the GPU do nothing and some inconsistences gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?

推荐答案

是的，Fermi（CC2.0）

Yes this is true for Fermi (CC2.0) and beyond.

正确。

如果访问是同时进行的，

If the accesses are simultaneous, they are serialized. Again, order is undefined.

不确定数据一致性是什么意思。无论如何，除了序列化同时写入，GPU还能做什么？我很惊讶这是一个困难的概念，因为我觉得没有明显的选择。

Not sure what you mean by data consistence. Anyway, what else could the GPU do except serialize simultaneous writes? I'm surprised this is such a difficult concept, as there appears to me to be no obvious alternative.

在全局存储器中同时写入的起源是什么，无论是从相同的经线，还是不同的经线，在不同的内核中，在不同的内核中。同时写入以未定义的顺序串行化。同样，对于数据一致性我想知道你的意思。同时读取和写入也会产生未定义的行为。读取可能会返回一个值，包括内存位置的初始值或任何写入的值。

It does not matter what is the origin of simultaneous writes to global memory, whether from the same warp, or different warps, in different blocks, in different kernels. Simultaneous writes are serialized, in an undefined order. Again, for "data consistence" I'd like to know what you mean by that. Simultaneous reads and writes are also going to produce undefined behavior. The reads may return a value including the initial value of the memory location or any of the values that were written.

同时写入任何GPU内存位置的最终结果是未定义。如果所有同时写入都写入相同的值，那么该位置中的最终值将反映该值。否则，最终值将反映已写入的值之一。哪个值未定义。除此之外，你的大部分问题和陈述对我来说没有意义。（你是什么意思是数据一致性？）你不应该期望从这种编程行为有什么理性。 GPU应该被编程为分布式独立工作机，而不是全局同步机。注意，undefined也意味着，即使输入数据相同，结果也可能从一个内核运行到下一个运行都不同。

The final result of simultaneous writes to any GPU memory location is undefined. If all simultaneous writes are writing the same value, then the final value in that location will reflect that. Otherwise, the final value will reflect one of the values that got written. Which value is undefined. Beyond that, most of your questions and statements don't make sense to me. (What do you mean by data consistence?) You should not expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that results may vary from one run of a kernel to the next, even if the input data is identical.

同时或几乎同时读取和由于插入在SM之间的独立的非相干L1高速缓存（其中线程块执行），来自不同块（来自相同或不同内核）的全局存储器的写入在费米（cc2.x）设备上是特别危险的， L2缓存（其是设备范围的，因此是一致的）。尝试使用全局内存作为载体在线程块之间创建同步行为是最困难的，并且不鼓励。建议考虑重新运用您的算法来独立构建工作的方法。

Simultaneous or nearly simultaneous reading and writing of global memory from different blocks (whether from the same or different kernels) is especially hazardous on Fermi (cc2.x) devices due to the independent non-coherent L1 caches that are interposed between the SMs (where the threadblocks execute) and the L2 cache (which is device-wide, and therefore coherent). Attempting to create synchronized behavior between threadblocks using global memory as a vehicle is difficult at best, and discouraged. It is suggested to consider ways to recast your algorithm to structure the work independently.

这篇关于如何通过来自不同内核的线程执行相同全局内存地址的访问？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！