问题描述
我正在尝试使用 Numba 和达斯达(dask)以加快慢速计算,类似于计算大量点的内核密度估算.我的计划是在jit
ed函数中编写计算量大的逻辑,然后使用dask
在CPU内核之间分配工作.我想使用numba.jit
函数的nogil
功能,以便可以使用dask
线程后端,以避免不必要的输入数据的内存副本(非常大).
I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jit
ed function and then split the work among the CPU cores using dask
. I wanted to use the nogil
feature of numba.jit
function so that I could use the dask
threading backend so as to avoid unnecessary memory copies of the input data (which is very large).
不幸的是,除非使用'processes'
调度程序,否则Dask不会加快速度.如果我改用ThreadPoolExector
,则可以看到预期的速度.
Unfortunately, Dask won't result in a speed up unless I use the 'processes'
scheduler. If I use a ThreadPoolExector
instead then I see the expected speed up.
这是我的问题的简化示例:
Here's a simplified example of my problem:
import os
import numpy as np
import numba
import dask
CPU_COUNT = os.cpu_count()
def render_internal(size, mag):
"""mag is the magnification to apply
generate coordinates internally
"""
coords = np.random.rand(size, 2)
img = np.zeros((mag, mag), dtype=np.int64)
for i in range(len(coords)):
y0, x0 = coords[i] * mag
y1, x1 = int(y0), int(x0)
m = 1
img[y1, x1] += m
jit_render_internal = numba.jit(render_internal, nogil=True, nopython=True)
args = 10000000, 100
print("Linear time:")
%time linear_compute = [jit_render_internal(*args) for i in range(CPU_COUNT)]
delayed_jit_render_internal = dask.delayed(jit_render_internal)
print()
print("Threads time:")
%time dask_compute_threads = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)])
print()
print("Processes time:")
%time dask_compute_processes = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)], scheduler="processes")
这是我机器上的输出:
Linear time:
Wall time: 1min 17s
Threads time:
Wall time: 1min 47s
Processes time:
Wall time: 7.79 s
对于处理后端和线程后端,我都看到了所有CPU内核的完全利用率,这与预期的一样.但是无法加快线程后端的速度.我很确定jitted函数jit_render_internal
实际上没有释放GIL.
For both the processing and threading backends I see complete utilization of all CPU cores, as expected. But no speed up for the threading backend. I'm pretty sure that the jitted function, jit_render_internal
, is not, in fact, releasing the GIL.
我的两个问题是:
- 如果将
nogil
关键字传递给numba.jit
并且无法释放GIL,为什么不引发错误? - 为什么我编写的代码没有发布GIL?所有的计算都嵌入到该函数中,并且没有返回值.
- If the
nogil
keyword is passed tonumba.jit
and the GIL cannot be released, why isn't an error raised? - Why doesn't the code, as I've written it, release the GIL? All the computation is embedded in the function and there's no return value.
推荐答案
尝试以下方法,它更快并且似乎可以解决线程性能问题:
Try the following, which is much faster and seems to fix the thread performance issue:
def render_internal(size, mag):
"""mag is the magnification to apply
generate coordinates internally
"""
coords = np.random.rand(size, 2)
img = np.zeros((mag, mag), dtype=np.int64)
for i in range(len(coords)):
#y0, x0 = coords[i] * mag
y0 = coords[i,0] * mag
x0 = coords[i,1] * mag
y1, x1 = int(y0), int(x0)
m = 1
img[y1, x1] += m
我已经在上面拆分了x0
和y0
的计算.在我的机器上,基于线程的解决方案实际上比更改后的进程要快.
I've split the calculation of x0
and y0
up in the above. On my machine, the threads based solution is actually faster than the processes after the change.
这篇关于Numba`nogil` +迟钝的线程后端导致无法加快速度(计算速度较慢!)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!