问题描述
我只是在4核机器上运行一些多线程代码,希望它比单核机器更快。这是一个想法:我有一个固定数量的线程(在我的情况下,每个核心一个线程)。每个线程执行以下形式的 Runnable
:
I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable
of the form:
private static int[] data; // data shared across all threads
public void run() {
int i = 0;
while (i++ < 5000) {
// do some work
for (int j = 0; j < 10000 / numberOfThreads) {
// each thread performs calculations and reads from and
// writes to a different part of the data array
}
// wait for the other threads
barrier.await();
}
}
在四核机器上,此代码执行更糟糕 4线程比1线程更糟糕。即使使用 CyclicBarrier
的开销,我也会认为代码的执行速度至少要快2倍。为什么它运行慢?
On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier
's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?
编辑:这是我尝试过的忙等待实现。不幸的是,它使程序在更多核心上运行得更慢(也在一个单独的问题中讨论):
Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):
public void run() {
// do work
synchronized (this) {
if (atomicInt.decrementAndGet() == 0) {
atomicInt.set(numberOfOperations);
for (int i = 0; i < threads.length; i++)
threads[i].interrupt();
}
}
while (!Thread.interrupted()) {}
}
推荐答案
添加更多线程并不一定能保证提高性能。使用其他线程可能会导致性能下降的原因有很多:
Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:
- 粗粒度锁定可能过度序列化执行 - 即锁定可能导致一次只运行一个线程。您可以获得多个线程的所有开销,但没有任何好处。尝试减少锁定的持续时间。
- 这同样适用于过于频繁的障碍和其他同步结构。如果内部
j
循环快速完成,您可能会将大部分时间花在障碍上。尝试在同步点之间做更多的工作。 - 如果代码运行得太快,可能没有时间将线程迁移到其他CPU核心。除非你创建了许多非常短暂的线程,否则这通常不是问题。使用线程池,或者只是让每个线程更多工作可以提供帮助。如果你的线程运行的时间超过一秒左右,这不太可能是一个问题。
- 如果你的线程正在处理很多共享的读/写数据,那么缓存行反弹可能会降低性能。也就是说,虽然这通常会导致性能下降,但仅凭这一点不太可能导致性能比单线程情况更差。尝试确保每个线程写入的数据与其他线程的数据按缓存行的大小(通常约为64字节)分开。特别是,没有输出数组,如
[线程A,B,C,D,A,B,C,D ......]
- Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
- The same applies to overly frequent barriers and other synchronization structures. If the inner
j
loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points. - If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
- If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like
[thread A, B, C, D, A, B, C, D ...]
由于您没有显示您的代码,我在这里无法详细说明。
Since you haven't shown your code, I can't really speak in any more detail here.
这篇关于Java - 多线程代码在更多内核上运行速度更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!