如果我们有GPGPU，为什么使用SIMD？

本文介绍了如果我们有GPGPU，为什么使用SIMD？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现在我们有了像CUDA和OpenCL这样的语言的GPGPU，多媒体SIMD扩展（SSE / AVX / NEON）是否仍然有用？

Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose?

文章最近关于SSE指令如何可以用于加速排序网络。我认为这很漂亮，但是当我告诉我的comp arch教授，他笑了，说在GPU上运行类似的代码会破坏SIMD版本。我不怀疑这一点，因为SSE非常简单，GPU是大型高度复杂的加速器，具有更多的并行性，但它让我想到，有很多场景，多媒体SIMD扩展比使用GPU更有用吗？

I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this was pretty neat but when I told my comp arch professor he laughed and said that running similar code on a GPU would destroy the SIMD version. I don't doubt this because SSE is very simple and GPUs are large highly-complex accelerators with a lot more parallelism, but it got me thinking, are there many scenarios where the multimedia SIMD extensions are more useful than using a GPU?

如果GPGPUs使SIMD冗余，为什么Intel会增加SIMD支持？ SSE是128位，现在它是与AVX的256位，明年它将是512位。如果GPGPUs更好地处理代码与数据并行性为什么英特尔推这些SIMD扩展？他们可能能够将等效资源（研究和区域）放入更大的缓存和分支预测器，从而提高串行性能。

If GPGPUs make SIMD redundant, why would Intel be increasing their SIMD support? SSE was 128 bits, now it's 256 bits with AVX and next year it will be 512 bits. If GPGPUs are better processing code with data parallelism why is Intel pushing these SIMD extensions? They might be able to put the equivalent resources (research and area) into a larger cache and branch predictor thus improving serial performance.

为什么使用SIMD而不是GPGPU？ p>

Why use SIMD instead of GPGPUs?

推荐答案

绝对SIMD仍然相关。

Absolutely SIMD is still relevant.

首先，SIMD可以更容易与标量代码互操作，因为它可以直接读取和写入相同的内存，而GPU需要将数据上传到GPU内存，然后才能访问。例如，直接通过SIMD向量化像memcmp（）这样的函数，但是通过将数据上传到GPU并运行它来实现memcmp（）是荒谬的。

First, SIMD can more easily interoperate with scalar code, because it can read and write the same memory directly, while GPUs require the data to be uploaded to GPU memory before it can be accessed. For example, it's straightforward to vectorize a function like memcmp() via SIMD, but it would be absurd to implement memcmp() by uploading the data to the GPU and running it there. The latency would be crushing.

其次，SIMD和GPU在高度分支的代码上是不好的，但是SIMD有点差。这是由于GPU在单个指令分派器下对多个线程（warp）进行分组的事实。那么当线程需要采取不同的路径时会发生什么：if分支在一个线程中执行，而else分支在另一个线程中执行？这被称为分支发散，它是慢的：所有if线程执行，而else线程等待，然后else线程执行，而if线程等待。 CPU核心，当然没有这个限制。

Second, both SIMD and GPUs are bad at highly branchy code, but SIMD is somewhat less worse. This is due to the fact that GPUs group multiple threads (a "warp") under a single instruction dispatcher. So what happens when threads need to take different paths: an if branch is taken in one thread, and the else branch is taken in another? This is called a "branch divergence" and it is slow: all the "if" threads execute while the "else" threads wait, and then the "else" threads execute while the "if" threads wait. CPU cores, of course, do not have this limitation.

结果是SIMD更适合可能被称为中间工作负载：工作负载到中等大小，具有一些数据并行性，访问模式中的一些不可预测性，一些分支性。 GPU对于具有可预测的执行流和访问模式的非常大的工作负载是更好的。

The upshot is that SIMD is better for what might be called "intermediate workloads:" workloads up to intermediate size, with some data-parallelism, some unpredictability in access patterns, some branchiness. GPUs are better for very large workloads that have predictable execution flow and access patterns.

（还有一些外设原因，如更好地支持CPU中的双精度浮点。）

(There's also some peripheral reasons, such as better support for double precision floating point in CPUs.)

这篇关于如果我们有GPGPU，为什么使用SIMD？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！