


I am really interested to understand how the GPU parallelizes different tasks such as real-time rendering and training the neural networks. I know the math behind parallelization but I am curious to know how GPU actually works. Real-time rendering and training neural networks are really different. How does GPU parallelize these two tasks efficiently?


GPU并行化要求将问题分解为尽可能多的独立,相等的计算(SIMD).C ++看起来像什么

GPU parallelization requires the problem to be split up in as many independent, equal computations as possible (SIMD). What in C++ looks like

void example(float* data, const int N) {
    for(int n=0; n<N; n++) {
        data[n] += 1.0f;

OpenCL C中的


in OpenCL C looks like this:

kernel void example(global float* data) {
    const int n = get_global_id(0);
    data[n] += 1.0f;


对于实时渲染,可通过使用单独的GPU核心绘制每个三角形来由GPU渲染网格化的表面. https://youtu.be/1ww8qRCMc4s

For real-time rendering, a tesselated surface can be rendered by the GPU by drawing every triangle using a seperate GPU core.https://youtu.be/1ww8qRCMc4s


Neural networks come down to large matrix multiplications and within a matrix individual colums or tiles can be computed in parallel independently at the same time. Vector additions for example are parallelized in as many vector components as there are and each GPU core computes only a single vecotor component.

基于网格的流体模拟(例如LBM)在假设256x256x256晶格点的3D晶格上工作.对于这16777216个晶格点中的每一个,计算都是相同的,并且由于它们彼此独立,因此可以并行执行.因此,模拟在GPU上划分为16777216个线程,每个格点对应一个线程.如果GPU具有4096个内核,则可以同时计算4096个内核.可以想象,这比在CPU上运行此类任务快几个数量级. https://youtu.be/a1u2g9ahIDk

Lattice based fluid simulations such as LBM work on a 3D lattice of lets say 256x256x256 lattice points. For each of these 16777216‬ lattice points the computations are the same and they can be done concurrently because they are independent of each other. So the simulation is split up to 16777216‬ threads on the GPU, one for every lattice point. If the GPU has 4096 cores, it can compute 4096 of these concurrently. As you can imagine, this is orders of magnitude faster than running such tasks on CPUs.https://youtu.be/a1u2g9ahIDk

粒子模拟可以在单独的GPU内核上计算每个粒子.只要粒子大部分是独立的,这就起作用. https://youtu.be/8Szib8Km5Mo

A particle simulation can compute each particle on a separate GPU core. This works as long as the particles are mostly independent.https://youtu.be/8Szib8Km5Mo

为获得良好的饱和度,以达到最高效率,线程数应比可用的GPU内核数大得多.例如,分支也会对性能造成影响,因为在32个GPU内核的组中,如果一个是 true 分支,而所有其他内核都在 false 分支中,则两个分支都必须由组内的所有核心计算.在网格化的表面渲染示例中,如果三角形的大小差异很大,则性能会受到类似的影响:整个团队都必须等待三角形最大的一个GPU内核完成.但是,如果所有三角形的大小都大致相同,则性能非常好.

For good saturation, to reach maximum efficiency, the number of threads should be much larger than the number of GPU cores available. Also branching for example takes a performance hit because in groups of 32 GPU cores, if one is the true branch and all the others are in the false branch, both branches have to be computed by all cores within the group.In the tesselated surface rendering example, if the triangles have vastly different sizes, performance takes hit for a similar reason: the entire group has to wait for the one GPU core with the largest triangle to finish. If all triangles are approximately the same size however, performance is very good.


08-19 23:35