问题描述
多线程内存访问是否比单线程内存访问快?
Is multi-thread memory access faster than single threaded memory access?
假设我们使用的是 C 语言.一个简单的例子如下.如果我有一个巨大的数组 A
并且我想将 A
复制到与 A
大小相同的数组 B
.使用多线程进行内存复制是否比使用单线程更快?多少线程适合做这种内存操作?
Assume we are in C language. A simple example is as follows. If I have a gigantic array A
and I want to copy A
to array B
with the same size as A
. Is using multithreading to do memory copy faster than it with a single thread? How many threads are suitable to do this kind of memory operation?
让我把问题说得更狭隘.首先,我们不考虑 GPU 的情况.在我们进行 GPU 编程时,内存访问优化是非常重要和有效的.根据我的经验,我们总是需要小心内存操作.另一方面,当我们在 CPU 上工作时,情况并非总是如此.另外,我们不考虑 SIMD 指令,例如 avx 和 sse.当程序有太多的内存访问操作而不是大量的计算操作时,这些也会显示内存性能问题.假设我们使用具有 1-2 个 CPU 的 x86 架构.每个 CPU 都有多个内核和一个四通道内存接口.主存储器是 DDR4,今天很常见.
Let me put the question more narrow. First of all, we do not consider the GPU case. The memory access optimization is very important and effective when we do GPU programming. In my experience, we always need to be careful about the memory operations. On the other hand, it is not always the case when we work on CPU. In addition, let's not consider about the SIMD instructions, such as avx and sse. Those will also show memory performance issues when the program has too many memory access operations as opposed to a lot of computational operations. Assume that we work an x86 architecture with 1-2 CPUs. Each CPU has multiple cores and a quad channel memory interface. The main memory is DDR4, as it is common today.
我的数组是一个双精度浮点数数组,大小和CPU的L3缓存差不多,大概50MB.现在,我有两种情况:1)使用按元素复制或使用 memcpy 将此数组复制到另一个具有相同大小的数组.2)将很多小数组组合成这个巨大的数组.两者都是实时操作,这意味着它们需要尽快完成.多线程是加速还是下拉?本例中影响内存操作性能的因素是什么?
My array is an array of double precision floating point numbers with the size similar to the size of L3 cache of a CPU, that is roughly 50MB. Now, I have two cases: 1) copy this array to another array with the same size using by doing element-wise copy or by using memcpy. 2) combine a lot of small arrays into this gigantic array. Both are real-time operations, meaning that they need to be done as fast as possible. Does multi-threading give a speedup or a dropdown? What's the factor in this case that affects the performance of memory operations?
有人说这主要取决于 DMA 性能.我想是我们做 memcpy 的时候.如果我们按元素进行复制,会先通过 CPU 缓存吗?
Someone said it will mostly depend on DMA performance. I think it is when we do memcpy. What if we do element-wise copy, does the pass through the CPU cache first?
推荐答案
这取决于很多因素.一个因素是您使用的硬件.在现代 PC 硬件上,多线程很可能不会导致性能提升,因为 CPU 时间不是复制操作的限制因素.限制因素是内存接口.CPU 很可能会使用 DMA 控制器进行复制,因此 CPU 在复制数据时不会太忙.
It depends on many factors. One factor is the hardware you use. On modern PC hardware, multithreading will most likely not lead to performance improvement, because CPU time is not the limiting factor of copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller to do the copying, so the CPU will not be too busy when copying data.
这篇关于多线程内存访问比单线程内存访问快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!