问题描述
从程序员的角度来看,GPGPU与常规的多核/多线程CPU编程之间的主要实践区别是什么?具体来说:
What are the key practical differences between GPGPU and regular multicore/multithreaded CPU programming, from the programmer's perspective? Specifically:
-
哪些类型的问题更适合常规多核?哪些类型的问题更适合GPGPU?
What types of problems are better suited to regular multicore and what types are better suited to GPGPU?
编程模型的主要区别是什么?
What are the key differences in programming model?
需要哪些基本硬件差异才能使编程模型有任何差异?
What are the key underlying hardware differences that necessitate any differences in programming model?
哪个通常更容易使用,多少钱?
Which one is typically easier to use and by how much?
从长远来看,为GPU实现高级并行性库是否可行,例如或 D的std.parallelism ?
Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?
如果GPU计算如此高效,为什么CPU的设计不像GPU?
If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?
推荐答案
有趣的问题.我已经研究了这个问题,因此我的答案是基于一些参考和个人经验.
Interesting question. I have researched this very problem so my answer is based on some references and personal experiences.
哪种类型的问题更适合常规多核,哪些类型的问题更适合GPGPU?
就像@Jared提到的那样. GPGPU是为非常规则的吞吐量工作负载而构建的,例如图形,密集矩阵矩阵乘法,简单的photoshop过滤器等.它们固有的设计可承受纹理采样(1000多个周期的操作),因此擅长长时间等待. GPU核心具有许多线程:当一个线程触发长时间等待操作(例如,内存访问)时,该线程将进入睡眠状态(其他线程继续工作),直到长时间等待操作完成为止.这样,GPU可以使执行单元的工作量比传统内核多得多.
Like @Jared mentioned. GPGPU are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate Texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.
GPU在处理分支方面表现不佳,因为GPU喜欢将线程"(如果不是nVidia,则为SIMD通道)批处理为翘曲并将它们一起发送到管道中,以节省指令获取/解码的能力.如果线程遇到分支,则它们可能会发散,例如8线程线程束中的2个线程可以采用该分支,而其他6个线程可能不采用该分支.现在,必须将经线分为大小为2和6的两个经线.如果您的核心具有8个SIMD通道(这就是为什么原来的经线间隔8个线程),那么您的两个新形成的经线将无法高效运行. 2线程扭曲将以25%的效率运行,而6线程扭曲将以75%的效率运行.您可以想象,如果GPU继续遇到嵌套分支,其效率将变得非常低.因此,GPU并不擅长处理分支,因此带有分支的代码不应在GPU上运行.
GPUs are bad at handling branches because GPUs like to batch "threads" (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp has to be split into two warps of size 2 and 6. If your core has 8 SIMD lanes (which is why original warp pakced 8 threads), now your two newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren't good at handling branches and hence code with branches should not be run on GPUs.
GPU在协作线程方面也很糟糕.如果线程需要彼此通信,则GPU将无法正常工作,因为GPU上对同步的支持不佳(但nVidia仍在其中).
GPUs are also bad a co-operative threading. If threads need to talk to each other then GPUs won't work well because synchronization is not well-supported on GPUs (but nVidia is on it).
因此,GPU的最差代码是并行度较低的代码或分支或同步较多的代码.
Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization.
编程模型的主要区别是什么?
GPU不支持中断和异常.对我来说,那是最大的不同.除此之外,CUDA与C差别不大.您可以编写CUDA程序,然后在其中将代码运送到GPU并在其中运行.您在CUDA中访问内存的方式有所不同,但这又不是我们讨论的基础.
GPUs don't support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.
哪些基本的硬件差异需要编程模型上的任何差异?
我已经提到他们了.最大的是GPU的SIMD特性,它要求以非常规则的方式编写代码,而没有分支和线程间通信.这就是例如CUDA限制代码中嵌套分支的数量的原因的一部分.
I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.
以下哪项通常更易于使用?多少钱?
取决于您要编码的内容和目标是什么.
Depends on what you are coding and what is your target.
易于矢量化的代码:CPU易于编码,但性能较低. GPU难于编写代码,但可以带来很大的收益.对于所有其他方面,CPU更容易使用,并且通常也具有更好的性能.
Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck.For all others, CPU is easier and often better performance as well.
从长远来看,为GPU实现高级并行性库(例如Microsoft的任务并行库或D的std.parallelism)是否可行?
根据定义,任务并行性需要线程通信,并且还具有分支.任务的思想是不同的线程执行不同的操作. GPU专为执行相同功能的许多线程而设计.我不会为GPU建立任务并行性库.
Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.
如果GPU计算如此高效,为什么CPU的设计不像GPU?
世界上很多问题都是分支性的和不规则的.数以千计的例子.图形搜索算法,操作系统,Web浏览器等.只需添加-甚至图形也像每一代一样变得越来越分支和通用,因此GPU也将越来越像CPU.我并不是说它们将变得像CPU,但它们将变得更具可编程性.正确的模型位于低功耗CPU和非常专业的GPU之间.
Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add -- even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.
这篇关于GPGPU与多核?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!