问题描述
我正在尝试对两个序列进行纯Python(无外部依赖项)的逐元素比较.我的第一个解决方案是:
I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:
list(map(operator.eq, seq1, seq2))
然后我从 itertools
中找到了 starmap
函数,该函数与我看起来非常相似.但是在最坏的情况下,它却比我的计算机快37%.由于对我来说并不明显,我测量了从生成器中检索1个元素所需的时间(不知道这种方法是否正确):
Then I found starmap
function from itertools
, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):
from operator import eq
from itertools import starmap
seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5
gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))
%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)
271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
在检索元素中,第二个解决方案的性能提高了24%.之后,它们对于 list
都产生相同的结果.但是从某个地方我们可以额外获得13%的时间:
In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list
. But from somewhere we gain extra 13% in time:
%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))
5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我不知道如何深入剖析此类嵌套代码?所以我的问题是,为什么第一个生成器的检索速度如此之快,并且从中我们可以从 list
函数中获得13%的额外收益?
I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list
function?
我的第一个意图是执行逐元素比较,而不是 all
,因此将 all
函数替换为 list
.此替换不会影响时间比例.
My first intention was to perform element-wise comparison instead of all
, so the all
function was replaced with list
. This replacement does not affect the timing ratio.
Windows 10(64位)上的CPython 3.6.2
CPython 3.6.2 on Windows 10 (64bit)
推荐答案
有几个因素(共同)有助于观察到的性能差异:
There are several factors that contribute (in conjunction) to the observed performance difference:
- 如果在下一个
-
zip
重复使用返回的tuple
,如果它的引用计数为1. -
map
构建一个 newtuple
,每次调用__ next __
时,该传递给映射函数"被制造.实际上,它可能不会从头开始创建新的元组,因为Python会维护未使用的元组的存储.但是在那种情况下,map
必须找到一个大小合适的未使用元组. -
starmap
检查可迭代对象中的下一项是否为tuple
类型,如果是,则将其继续传递. - 使用
PyObject_Call
从C代码中调用C函数不会创建传递给被调用方的新元组.
__ next __
调用时,zip
re-uses the returnedtuple
if it has a reference count of 1 when the next__next__
call is made.map
builds a newtuple
that is passed to the "mapped function" every time a__next__
call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that casemap
has to find an unused tuple of the right size.starmap
checks if the next item in the iterable is of typetuple
and if so it just passes it on.- Calling a C function from within C code with
PyObject_Call
won't create a new tuple that is passed to the callee.
因此带有 zip
的 starmap
将仅一次又一次使用一个元组,并将其传递给 operator.eq
,从而减少了函数调用的开销非常.另一方面,每次调用 operator.eq
时, map
都会创建一个新的元组(或从CPython 3.6开始填充一个C数组).因此,速度的差异实际上就是元组创建的开销.
So starmap
with zip
will only use one tuple over and over again that is passed to operator.eq
thus reducing the function call overhead immensely. map
on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq
is called. So what is actually the speed difference is just the tuple creation overhead.
我将提供一些Cython代码来验证这一点,而不是链接到源代码:
Instead of linking to the source code I'll provide some Cython code that can be used to verify this:
In [1]: %load_ext cython
In [2]: %%cython
...:
...: from cpython.ref cimport Py_DECREF
...:
...: cpdef func(zipper):
...: a = next(zipper)
...: print('a', a)
...: Py_DECREF(a)
...: b = next(zipper)
...: print('a', a)
In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)
是的, tuple
并不是真正不变的,一个简单的 Py_DECREF
足以欺骗" zip
,让别人相信没有人持有引用返回的元组!
Yes, tuple
s aren't really immutable, a simple Py_DECREF
was sufficient to "trick" zip
into believing noone else holds a reference to the returned tuple!
关于元组通过":
In [4]: %%cython
...:
...: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
In [5]: func(1, 2)
1404350461320
1404350461320
因此,元组将直接通过(只是因为它们被定义为C函数!)对于纯Python函数不会发生这种情况:
So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:
In [6]: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
...:
In [7]: func(1, 2)
1404350436488
1404352833800
请注意,即使被调用的函数不是C函数,即使从C函数调用也不会发生这种情况:
Note that it also doesn't happen if the called function isn't a C function even if called from a C function:
In [8]: %%cython
...:
...: def func_inner_c(*args):
...: print(id(args))
...:
...: def func(inner, *args):
...: print(id(args))
...: inner(*args)
...:
In [9]: def func_inner_py(*args):
...: print(id(args))
...:
...:
In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184
In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824
因此,存在很多巧合",导致具有 zip
的 starmap
比调用具有多个参数的 map
更快.当被调用函数也是C函数时...
So there are a lot of "coincidences" leading up to the point that starmap
with zip
is faster than calling map
with multiple arguments when the called function is also a C function...
这篇关于地图与星图的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!