本文介绍了numpy比numba和cython快,如何改进numba代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个简单的示例,可以帮助我了解使用numba和cython的情况.我是numba和cython的新手.我已经尽力将所有技巧结合在一起以使numba更快,并在某种程度上与cython相同,但我的numpy代码几乎比numba快2倍(对于float64),如果使用float32,则快2倍以上.不知道我在这里想念什么.

I have a simple example here to help me understand using numba and cython. I am `new to both numba and cython. I've tried my best with to incorporate all the tricks to make numba fast and to some extent, the same for cython but my numpy code is almost 2x faster than numba (for float64), more than 2x faster if using float32. Not sure what I am missing here.

我在想问题可能不再是编码,而是更多关于编译器的问题,而我对此不太熟悉.

I was thinking perhaps the problem isn't coding anymore but more about compiler and such which I'm not very familiar with.

我已经走过很多关于numpy,numba和cython的stackoverflow帖子,却没有找到直接的答案.

I've gone thru a lot of stackoverflow post about numpy, numba and cython and found no straight answers.

numpy版本:

def py_expsum(x):
    return np.sum( np.exp(x) )

numba版本:

@numba.jit( nopython=True)
def nb_expsum(x):
    nx, ny = x.shape
    val = 0.0
    for ix in range(nx):
        for iy in range(ny):
            val += np.exp(x[ix, iy])
    return val

Cython版本:

import numpy as np
import cython
from libc.math cimport exp

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef double cy_expsum2 ( double[:,:] x, int nx, int ny ):
    cdef:
        double val = 0.0
        int ix, iy
    for ix in range(nx):
        for iy in range(ny):
            val += exp(x[ix, iy])
    return val

播放大小为2000 x 1000的数组,并循环播放100次以上.对于numba,首次激活它的次数不计入循环中.

play with array of size 2000 x 1000 and loop over 100 times. For numba, the first time it's activated is not counted in the loop.

使用python 3(anaconda发行版),窗口10

Using python 3 (anaconda distribution), window 10

               float64       /   float32
    1. numpy : 0.56 sec      /   0.23 sec
    2. numba : 0.93 sec      /   0.74 sec
    3. cython: 0.83 sec

cython与numba接近.所以对我来说,最大的问题是,为什么numba不能击败numpy的运行时?我在这里做错了什么或想念什么?其他因素如何起作用?如何找出?

cython is close to numba. So the big question for me is why can't the numba beat the numpy's runtime? What did I do wrong or missing here ? How can other factors contribute and how do I find out ?

推荐答案

我们将看到,行为取决于所使用的numpy-distribution.

As we will see the behavior is dependent on which numpy-distribution is used.

此答案将重点放在采用英特尔VML(矢量数学库)的Anacoda发行版上,在使用其他硬件和numpy版本的情况下,铣削可能会有所不同.

This answer will focus on Anacoda-distribution with Intel's VML (vector math library), millage can vary given another hardware and numpy-version.

还将显示如何通过Cython或 numexpr ,以防万一不使用Anacoda-distribution,后者将VML插入后台进行一些numpy操作.

It will also be shown, how VML can be utilized via Cython or numexpr, in case one doesn't use Anacoda-distribution, which plugs-in VML under the hood for some numpy-operations.

对于以下尺寸,我可以复制您的结果

I can reproduce your results, for the following dimensions

N,M=2*10**4, 10**3
a=np.random.rand(N, M)

我得到:

%timeit py_expsum(a)  #   87ms
%timeit nb_expsum(a)  #  672ms
%timeit nb_expsum2(a)  #  412ms

最大的计算时间份额(约90%)用于评估exp-函数,正如我们将看到的,这是一项CPU密集型任务.

The lion's share (about 90%) of calculation-time is used for evaluation of exp- function, and as we will see, it is a CPU-intensive task.

快速浏览top -statistics显示,numpy的版本已并行执行,但numba并非如此.但是,在我只有两个处理器的VM上,仅并行化无法解释因子7的巨大差异(如DavidW的版本nb_expsum2所示).

Quick glance at the top-statistics show, that numpy's version is executed parallized, but this is not the case for numba. However, on my VM with only two processors the parallelization alone cannot explain the huge difference of factor 7 (as shown by DavidW's version nb_expsum2).

通过perf为两个版本配置代码都显示以下内容:

Profiling the code via perf for both versions shows the following:

nb_expsum

Overhead  Command  Shared Object                                      Symbol
  62,56%  python   libm-2.23.so                                       [.] __ieee754_exp_avx
  16,16%  python   libm-2.23.so                                       [.] __GI___exp
   5,25%  python   perf-28936.map                                     [.] 0x00007f1658d53213
   2,21%  python   mtrand.cpython-37m-x86_64-linux-gnu.so             [.] rk_random

py_expsum

  31,84%  python   libmkl_vml_avx.so                                  [.] mkl_vml_kernel_dExp_E9HAynn                                   ▒
   9,47%  python   libiomp5.so                                        [.] _INTERNAL_25_______src_kmp_barrier_cpp_38a91946::__kmp_wait_te▒
   6,21%  python   [unknown]                                          [k] 0xffffffff8140290c                                            ▒
   5,27%  python   mtrand.cpython-37m-x86_64-linux-gnu.so             [.] rk_random

正如人们所看到的:numpy在引擎盖下使用了英特尔的并行化矢量化的mkl/vml-version,它的性能很容易超过numba使用的gnu-math-library(lm.so)版本(或numba或并行版本). cython来解决这个问题).通过使用并行化,可以稍微平整地面,但是mkl的矢量化版本仍然胜过numba和cython.

As one can see: numpy uses Intel's parallized vectorized mkl/vml-version under the hood, which easily outperforms the version from the gnu-math-library (lm.so) used by numba (or by parallel version of numba or by cython for that matter). One could level the ground a little bit by using the parallization, but still mkl's vectorized version would outperform numba and cython.

但是,仅查看一种尺寸的性能并不是很有启发性,对于exp(对于其他先验功能),有2个方面需要考虑:

However, seeing performance only for one size isn't very enlightening and in case of exp (as for other transcendental function) there are 2 dimensions to consider:

  • 数组中的元素数量-缓存效果和针对不同大小的不同算法(numpy闻所未闻)可导致不同的性能.
  • 取决于x值,需要不同的时间来计算exp(x).通常,有三种不同类型的输入会导致不同的计算时间:非常小,标准和非常大(结果有限)
  • number of elements in the array - cache effects and different algorithms for different sizes (not unheard of in numpy) can leads to different performances.
  • depending on the x-value, different times are needed to calculate exp(x). Normally there are three different types of input leading to different calculation times: very small, normal and very big (with non-finite results)

我正在使用perfplot可视化结果(请参阅附录中的代码).对于正常"范围,我们获得以下性能:

I'm using perfplot to visualize the result (see code in appendix). For "normal" range we get the following performaces:

尽管0.0的性能相似,但我们可以看到,结果变为无限时,英特尔的VML会产生相当大的负面影响:

and while the performance for 0.0 is similar, we can see, that Intel's VML gets quite a negative impact as soon as results becomes infinite:

不过,还有其他需要注意的地方:

However there are other things to observe:

  • 对于矢量大小,<= 8192 = 2^13 numpy使用exp的非并行glibc版本(也使用相同的numba和cython).
  • 我使用的
  • Anaconda-distribution 覆盖了numpy的功能和插件英特尔的VML库(大小大于8192)已矢量化和并行化-这解释了大小大约为10 ^ 4的运行时间的减少.
  • 对于较小的大小,numba轻松击败了常规的glibc版本(对于numpy而言,开销太大),但是(如果numpy不切换到VML)对于较大的数组将没有太大的区别.
  • 这似乎是CPU限制的任务-我们在任何地方都看不到缓存边界.
  • 并行的numba版本只有在元素数超过500时才有意义.
  • For vector sizes <= 8192 = 2^13 numpy uses non-parallelized glibc-version of exp (the same numba and cython are using as well).
  • Anaconda-distribution, which I use, overrides numpy's functionality and plugs Intel's VML-library for sizes > 8192, which is vectorized and parallelized - this explains the drop in running times for sizes about 10^4.
  • numba beats the usual glibc-version easily (too much overhead for numpy) for smaller sizes, but there would be (if numpy would not switch to VML) not much difference for bigger array.
  • It seems to be a CPU-bound task - we cannot see cache-boundaries anywhere.
  • Parallized numba-version makes only sense if there are more than 500 elements.

那会有什么后果?

  1. 如果元素不超过8192个,则应使用numba-version.
  2. 否则为numpy版本(即使没有可用的VML插件也不会损失太多).

注意:numba无法自动使用Intel VML中的vdExp(如注释中的部分建议),因为它单独计算exp(x),而VML在整个阵列上运行.

NB: numba cannot automaticaly use vdExp from Intel's VML (as partly suggested in comments), because it calculates exp(x) individually, while VML operates on a whole array.

可以减少写入和加载数据时的缓存丢失,这是由numpy-version使用以下算法执行的:

One could reduce cache misses when writing and loading data, which is performed by the numpy-version using the following algorithm:

  1. 在AML上执行VML的 vdExp 一部分数据适合缓存,但也不能太小(开销).
  2. 总结得出的工作数组.
  3. 执行1. + 2.对于下一部分数据,直到处理完所有数据为止.

但是,与numpy的版本相比,我预计不会获得超过10%的收益(但也许我错了),因为无论如何,90%的计算时间都花在了MVL中.

However, I would not expect to gain more than 10% (but maybe I'm wrong)compared to numpy's version as 90% of computation time is spent in MVL anyway.

尽管如此,这还是可以在Cython中实现快速而肮脏的实现:

Nevertheless, here is a possible quick&dirty implementation in Cython:

%%cython -L=<path_mkl_libs> --link-args=-Wl,-rpath=<path_mkl_libs> --link-args=-Wl,--no-as-needed -l=mkl_intel_ilp64 -l=mkl_core -l=mkl_gnu_thread -l=iomp5
# path to mkl can be found via np.show_config()
# which libraries needed: https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

# another option would be to wrap mkl.h:
cdef extern from *:
    """
    // MKL_INT is 64bit integer for mkl-ilp64
    // see https://software.intel.com/en-us/mkl-developer-reference-c-c-datatypes-specific-to-intel-mkl
    #define MKL_INT long long int
    void  vdExp(MKL_INT n, const double *x, double *y);
    """
    void vdExp(long long int n, const double *x, double *y)

def cy_expsum(const double[:,:] v):
        cdef:
            double[1024] w;
            int n = v.size
            int current = 0;
            double res = 0.0
            int size = 0
            int i = 0
        while current<n:
            size = n-current
            if size>1024:
                size = 1024
            vdExp(size, &v[0,0]+current, w)
            for i in range(size):
                res+=w[i]
            current+=size
        return res

但是,numexpr确实可以做什么,它也使用Intel的vml作为后端:

However, it is exactly, what numexpr would do, which also uses Intel's vml as backend:

 import numexpr as ne
 def ne_expsum(x):
     return ne.evaluate("sum(exp(x))")

关于计时,我们可以看到以下内容:

As for timings we can see the follow:

具有以下值得注意的细节:

with following noteworthy details:

  • numpy,numexpr和cython版本对于较大的阵列几乎具有相同的性能-这并不奇怪,因为它们使用相同的vml功能.
  • 这三个版本中,cython版本的开销最少,而numexpr最多
  • numexpr-version可能是最容易编写的(鉴于并非每个numpy发行版都具有mvl功能).

列表:

图:

import numpy as np
def py_expsum(x):
    return np.sum(np.exp(x))

import numba as nb
@nb.jit( nopython=True)
def nb_expsum(x):
    nx, ny = x.shape
    val = 0.0
    for ix in range(nx):
        for iy in range(ny):
            val += np.exp( x[ix, iy] )
    return val

@nb.jit( nopython=True, parallel=True)
def nb_expsum2(x):
    nx, ny = x.shape
    val = 0.0
    for ix in range(nx):
        for iy in nb.prange(ny):
            val += np.exp( x[ix, iy]   )
    return val

import perfplot
factor = 1.0 # 0.0 or 1e4
perfplot.show(
    setup=lambda n: factor*np.random.rand(1,n),
    n_range=[2**k for k in range(0,27)],
    kernels=[
        py_expsum,
        nb_expsum,
        nb_expsum2,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

这篇关于numpy比numba和cython快,如何改进numba代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 23:57