问题描述
我正在使用多线程在 C 中实现图像过滤操作并使其尽可能优化.但我有一个问题:如果线程 0 访问内存,并且线程 1 同时访问同一内存,它会从缓存中获取它吗?这个问题源于这两个线程可能运行在 CPU 的两个不同内核中的可能性.所以另一种说法是:所有内核都共享相同的公共缓存吗?
I am implementing an image filtering operation in C using multiple threads and making it as optimized as possible. I have one question though: If a memory is accessed by thread-0, and concurrently if the same memory is accessed by thread-1, will it get it from the cache ? This question stems from the possibility that these two threads could be running into two different cores of the CPU. So another way of putting this is: do all the cores share the same common cache memory ?
假设我有一个如下所示的内存布局
Suppose i have a memory layout like the following
int 输出[100];
int output[100];
假设有 2 个 CPU 内核,因此我生成了两个线程以同时工作.一种方案可能是将内存分成两个块,0-49 和 50-99,让每个线程在每个块上工作.另一种方法可能是让线程 0 处理偶数索引,例如 0 2 4 等等......而另一个线程处理奇数索引,例如 1 3 5 .... 后面的技术更容易实现(特别是对于 3D数据),但我不确定我是否可以通过这种方式有效地使用缓存.
Assume there are 2 CPU cores and hence I spawn two threads to work concurrently. One scheme could be to divide the memory into two chunks, 0-49 and 50-99 and let each thread work on each chunk. Another way could be to let thread-0 work on even indices, like 0 2 4 and so on.. while the other thread work on odd indices like 1 3 5 .... This later technique is easier to implement (specially for 3D data) but I am not sure if I could use the cache efficiently this way.
推荐答案
一般来说,共享重叠的内存区域是一个坏主意,比如一个线程处理 0,2,4... 而其他进程 1,3,5...虽然某些架构可能支持这一点,但大多数架构不会,而且您可能无法指定您的代码将在哪些机器上运行.此外,操作系统可以自由地将您的代码分配给它喜欢的任何内核(单个内核、同一物理处理器上的两个内核或不同处理器上的两个内核).此外,每个 CPU 通常都有一个单独的一级缓存,即使它们在同一个处理器上.
In general it is a bad idea to share overlapping memory regions like if one thread processes 0,2,4... and the other processes 1,3,5... Although some architectures may support this, most architectures will not, and you probably can not specify on which machines your code will run on. Also the OS is free to assign your code to any core it likes (a single one, two on the same physical processor, or two cores on separate processors). Also each CPU usually has a separate first level cache, even if its on the same processor.
在大多数情况下,0,2,4.../1,3,5... 会极大地降低性能,甚至可能比单个 CPU 还慢.Herb Sutters 消除虚假分享" 很好地证明了这一点.
In most situations 0,2,4.../1,3,5... will slow down performance extremely up to possibly being slower than a single CPU.Herb Sutters "Eliminate False Sharing" demonstrates this very well.
使用 [...n/2-1] 和 [n/2...n] 方案将在大多数系统上更好地扩展.它甚至可能导致超线性性能,因为可以使用所有 CPU 的缓存大小.使用的线程数应始终可配置,并应默认为找到的处理器内核数.
Using the scheme [...n/2-1] and [n/2...n] will scale much better on most systems. It even may lead to super linear performance as the cache size of all CPUs in sum can be possibly used. The number of threads used should be always configurable and should default to the number of processor cores found.
这篇关于多线程和 CPU 缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!