Numba`nogil` +迟钝的线程后端导致无法加快速度(计算速度较慢！)

本文介绍了Numba`nogil` +迟钝的线程后端导致无法加快速度(计算速度较慢！)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 Numba 和达斯达(dask)以加快慢速计算，类似于计算大量点的内核密度估算.我的计划是在jit ed函数中编写计算量大的逻辑，然后使用dask在CPU内核之间分配工作.我想使用numba.jit函数的nogil功能，以便可以使用dask线程后端，以避免不必要的输入数据的内存副本(非常大).

I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jited function and then split the work among the CPU cores using dask. I wanted to use the nogil feature of numba.jit function so that I could use the dask threading backend so as to avoid unnecessary memory copies of the input data (which is very large).

不幸的是，除非使用'processes'调度程序，否则Dask不会加快速度.如果我改用ThreadPoolExector，则可以看到预期的速度.

Unfortunately, Dask won't result in a speed up unless I use the 'processes' scheduler. If I use a ThreadPoolExector instead then I see the expected speed up.

这是我的问题的简化示例:

Here's a simplified example of my problem:

import os
import numpy as np
import numba
import dask

CPU_COUNT = os.cpu_count()

def render_internal(size, mag):
    """mag is the magnification to apply
    generate coordinates internally
    """
    coords = np.random.rand(size, 2)
    img = np.zeros((mag, mag), dtype=np.int64)
    for i in range(len(coords)):
        y0, x0 = coords[i] * mag
        y1, x1 = int(y0), int(x0)
        m = 1
        img[y1, x1] += m

jit_render_internal = numba.jit(render_internal, nogil=True, nopython=True)

args = 10000000, 100

print("Linear time:")
%time linear_compute = [jit_render_internal(*args) for i in range(CPU_COUNT)]

delayed_jit_render_internal = dask.delayed(jit_render_internal)

print()
print("Threads time:")
%time dask_compute_threads = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)])

print()
print("Processes time:")
%time dask_compute_processes = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)], scheduler="processes")

这是我机器上的输出:

Linear time:
Wall time: 1min 17s

Threads time:
Wall time: 1min 47s

Processes time:
Wall time: 7.79 s

对于处理后端和线程后端，我都看到了所有CPU内核的完全利用率，这与预期的一样.但是无法加快线程后端的速度.我很确定jitted函数jit_render_internal实际上没有释放GIL.

For both the processing and threading backends I see complete utilization of all CPU cores, as expected. But no speed up for the threading backend. I'm pretty sure that the jitted function, jit_render_internal, is not, in fact, releasing the GIL.

我的两个问题是:

如果将nogil关键字传递给numba.jit并且无法释放GIL，为什么不引发错误?
为什么我编写的代码没有发布GIL?所有的计算都嵌入到该函数中，并且没有返回值.

If the nogil keyword is passed to numba.jit and the GIL cannot be released, why isn't an error raised?
Why doesn't the code, as I've written it, release the GIL? All the computation is embedded in the function and there's no return value.

推荐答案

尝试以下方法，它更快并且似乎可以解决线程性能问题:

Try the following, which is much faster and seems to fix the thread performance issue:

def render_internal(size, mag):
    """mag is the magnification to apply
    generate coordinates internally
    """
    coords = np.random.rand(size, 2)
    img = np.zeros((mag, mag), dtype=np.int64)
    for i in range(len(coords)):
        #y0, x0 = coords[i] * mag
        y0 = coords[i,0] * mag
        x0 = coords[i,1] * mag
        y1, x1 = int(y0), int(x0)
        m = 1
        img[y1, x1] += m

我已经在上面拆分了x0和y0的计算.在我的机器上，基于线程的解决方案实际上比更改后的进程要快.

I've split the calculation of x0 and y0 up in the above. On my machine, the threads based solution is actually faster than the processes after the change.

这篇关于Numba`nogil` +迟钝的线程后端导致无法加快速度(计算速度较慢！)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！