python - 为什么`vectorize`优于`frompyfunc`？

Numpy提供的vectorize和frompyfunc具有相似的功能。

如该SO-post中所指出的，vectorize wraps frompyfunc并正确处理返回数组的类型，而frompyfunc返回np.object的数组。

但是，对于所有大小，frompyfunc始终优于vectorize 10-20％，这也无法用不同的返回类型来解释。

考虑以下变体：

import numpy as np

def do_double(x):
    return 2.0*x

vectorize = np.vectorize(do_double)

frompyfunc = np.frompyfunc(do_double, 1, 1)

def wrapped_frompyfunc(arr):
    return frompyfunc(arr).astype(np.float64)

wrapped_frompyfunc只是将frompyfunc的结果转换为正确的类型-如我们所见，此操作的成本几乎可以忽略不计。

结果为以下时间（蓝线为frompyfunc）：

python - 为什么`vectorize`优于`frompyfunc`？-LMLPHP

我希望vectorize会有更多开销-但这仅在小尺寸情况下才能看到。另一方面，在np.object中也可以将np.float64转换为wrapped_frompyfunc-仍然要快得多。

如何解释这种性能差异？

使用perfplot-package产生时序比较的代码（鉴于上述功能）：

import numpy as np
import perfplot
perfplot.show(
    setup=lambda n: np.linspace(0, 1, n),
    n_range=[2**k for k in range(20,27)],
    kernels=[
        frompyfunc,
        vectorize,
        wrapped_frompyfunc,
        ],
    labels=["frompyfunc", "vectorize", "wrapped_frompyfunc"],
    logx=True,
    logy=False,
    xlabel='len(x)',
    equality_check = None,
    )

注意：对于较小的尺寸，vectorize的开销要高得多，但这是可以预料的（毕竟它包装了frompyfunc）：

最佳答案

遵循@hpaulj的提示，我们可以分析vectorize功能：

arr=np.linspace(0,1,10**7)
%load_ext line_profiler

%lprun -f np.vectorize._vectorize_call \
       -f np.vectorize._get_ufunc_and_otypes  \
       -f np.vectorize.__call__  \
       vectorize(arr)

这表明_vectorize_call花费了100％的时间：

Timer unit: 1e-06 s

Total time: 3.53012 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: __call__ at line 2063

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2063                                               def __call__(self, *args, **kwargs):
  ...
  2091         1    3530112.0 3530112.0    100.0          return self._vectorize_call(func=func, args=vargs)

...

Total time: 3.38001 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: _vectorize_call at line 2154

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2154                                               def _vectorize_call(self, func, args):
  ...
  2161         1         85.0     85.0      0.0              ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
  2162
  2163                                                       # Convert args to object arrays first
  2164         1          1.0      1.0      0.0              inputs = [array(a, copy=False, subok=True, dtype=object)
  2165         1     117686.0 117686.0      3.5                        for a in args]
  2166
  2167         1    3089595.0 3089595.0     91.4              outputs = ufunc(*inputs)
  2168
  2169         1          4.0      4.0      0.0              if ufunc.nout == 1:
  2170         1     172631.0 172631.0      5.1                  res = array(outputs, copy=False, subok=True, dtype=otypes[0])
  2171                                                       else:
  2172                                                           res = tuple([array(x, copy=False, subok=True, dtype=t)
  2173                                                                        for x, t in zip(outputs, otypes)])
  2174         1          1.0      1.0      0.0          return res

它显示了我在假设中遗漏的部分：将双数组完全在预处理步骤中转换为对象数组（在内存方面做这不是很明智的事情）。其他部分与wrapped_frompyfunc类似：

Timer unit: 1e-06 s

Total time: 3.20055 s
File: <ipython-input-113-66680dac59af>
Function: wrapped_frompyfunc at line 16

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def wrapped_frompyfunc(arr):
    17         1    3014961.0 3014961.0     94.2      a = frompyfunc(arr)
    18         1     185587.0 185587.0      5.8      b = a.astype(np.float64)
    19         1          1.0      1.0      0.0      return b

当我们查看峰值内存消耗（例如通过/usr/bin/time python script.py）时，我们会看到vectorized版本的内存消耗是frompyfunc的两倍，它使用了更复杂的策略：处理双数组以大小为NPY_BUFSIZE的块（为8192）为单位，因此内存中同时仅存在8192个python-float（24bytes + 8byte指针）（而不是数组中的元素数量，后者可能更高）。从OS保留内存的成本以及更多的高速缓存未命中可能是导致运行时间增加的原因。

我的收获：

可能根本不需要将所有输入都转换为对象数组的预处理步骤，因为frompyfunc具有处理这些转换的更为复杂的方法。
当生成的vectorize应该以“实数”形式使用时，都不应该使用frompyfunc不能使用ufunc。相反，应该用C编写它或使用numba / like。

与双数组相比，在对象数组上调用frompyfunc需要的时间更少：

arr=np.linspace(0,1,10**7)
a = arr.astype(np.object)
%timeit frompyfunc(arr)  # 1.08 s ± 65.8 ms
%timeit frompyfunc(a)    # 876 ms ± 5.58 ms

但是，上面的行剖析器时序未显示在对象上使用ufunc而不是双精度的任何优势：3.089595s与3014961.0s。我的怀疑是，这是由于在创建所有对象的情况下更多的缓存未命中，而L2缓存中只有8192个创建的对象（256Kb）很热。