本文介绍了为什么这个循环比创建字典的字典理解更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有软件/计算机科学背景,但我喜欢用 Python 编写代码,并且通常可以理解为什么事情会更快.我真的很想知道为什么这个 for 循环比字典理解运行得更快.任何见解?

问题:给定一个带有这些键和值的字典a,返回一个以值作为键和键作为值的字典.(挑战:在一行中完成)

和代码

a = {'a':'hi','b':'hey','c':'yo'}b = {}对于 a.items() 中的 i,j:b[j]=i%% timeit 932 ns ± 37.2 ns 每个循环b = {v: k for k, v in a.items()}%% timeit 1.08 µs ± 16.4 ns 每个循环
解决方案

您正在使用太小的输入进行测试;虽然与列表推导相比,字典推导相对于 for 循环没有那么多的性能优势,但对于实际问题的大小,它可以并且确实击败了 for 循环,尤其是在定位到全局名称时.

您的输入仅包含 3 个键值对.改为使用 1000 个元素进行测试,我们看到时间非常接近:

>>>导入时间>>>来自随机导入选择,randint;从字符串导入 ascii_lowercase 作为字母>>>循环 = '''\... b = {}...对于 a.items() 中的 i,j:... b[j]=i...'''>>>dictcomp = '''b = {v: k for k, v in a.items()}'''>>>def rs(): return ''.join([choice(letters) for _ in range(randint(3, 15))])...>>>a = {rs(): rs() for _ in range(1000)}>>>连(一)1000>>>count, total = timeit.Timer(looped, 'from __main__ import a').autorange()>>>(total/count) * 1000000 # 每次运行微秒66.62004760000855>>>count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()>>>(total/count) * 1000000 # 每次运行微秒64.5464928005822

区别就在那里,dict comp 速度更快,但仅 只是 在这种规模下.使用 100 倍的键值对,差异会更大一些:

>>>a = {rs(): rs() for _ in range(100000)}>>>连(一)98476>>>count, total = timeit.Timer(looped, 'from __main__ import a').autorange()>>>total/count * 1000 #毫秒,不同的规模!15.48140200029593>>>count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()>>>total/count * 1000 #毫秒,不同的规模!13.674790799996117

当您考虑处理近 10 万个键值对时,这不是 大区别.尽管如此,for 循环显然.

那么为什么 3 个元素的速度差异呢?因为推导式(字典、集合、列表推导式或生成器表达式)在底层实现为一个新的函数,并且调用该函数有一个基本成本,普通循环不必支付.

这是两种替代方案的字节码的反汇编;注意字典理解的顶级字节码中的 MAKE_FUNCTIONCALL_FUNCTION 操作码,有一个单独的部分说明该函数的作用,实际上几乎没有区别在这两种方法之间:

>>>导入文件>>>dis.dis(循环)1 0 BUILD_MAP 02 STORE_NAME 0 (b)2 4 SETUP_LOOP 28(到 34)6 LOAD_NAME 1 (a)8 LOAD_METHOD 2(项目)10 呼叫方法 012 GET_ITER>>14 FOR_ITER 16(到 32)16 UNPACK_SEQUENCE 218 STORE_NAME 3 (i)20 STORE_NAME 4 (j)3 22 LOAD_NAME 3 (i)24 LOAD_NAME 0 (b)26 LOAD_NAME 4 (j)28 STORE_SUBSCR30 JUMP_ABSOLUTE 14>>32 POP_BLOCK>>34 LOAD_CONST 0 (无)36 RETURN_VALUE>>>dis.dis(dictcomp)1 0 LOAD_CONST 0(在0x11d6ade40,文件",第1行>)2 LOAD_CONST 1 ('')4 MAKE_FUNCTION 06 LOAD_NAME 0 (a)8 LOAD_METHOD 1(项目)10 呼叫方法 012 GET_ITER14 CALL_FUNCTION 116 STORE_NAME 2 (b)18 LOAD_CONST 2(无)20 RETURN_VALUE<代码对象<dictcomp>的反汇编在 0x11d6ade40,文件<dis>",第 1 行>:1 0 BUILD_MAP 02 LOAD_FAST 0 (.0)>>4 FOR_ITER 14(到 20)6 解包_序列 28 STORE_FAST 1 (k)10 STORE_FAST 2 (v)12 LOAD_FAST 1 (k)14 LOAD_FAST 2 (v)16 MAP_ADD 218 JUMP_ABSOLUTE 4>>20 RETURN_VALUE

实质区别:循环代码使用LOAD_NAME for b 每次迭代,STORE_SUBSCR 将键值对存储在dict加载中.字典理解使用MAP_ADD 来实现与STORE_SUBSCR 相同的功能,但不必每次都加载那个b 名称.

但是只有3 次迭代,字典理解必须执行的MAKE_FUNCTION/CALL_FUNCTION 组合才是对性能的真正拖累:

>>>make_and_call = '(lambda i: None)(None)'>>>dis.dis(make_and_call)1 0 LOAD_CONST 0(<代码对象<lambda>在0x11d6ab270,文件<dis>",第1行>)2 LOAD_CONST 1 ('<lambda>')4 MAKE_FUNCTION 06 LOAD_CONST 2(无)8 CALL_FUNCTION 110 RETURN_VALUE<代码对象<lambda>的反汇编在 0x11d6ab270,文件<dis>",第 1 行>:1 0 LOAD_CONST 0(无)2 RETURN_VALUE>>>计数,总计 = timeit.Timer(make_and_call).autorange()>>>总数/计数 * 10000000.12945385499915574

创建一个带一个参数的函数对象并调用它需要超过 0.1 μs(对于我们传入的 None 值,使用额外的 LOAD_CONST )!这就是 3 个键值对的循环和理解时间之间的差异.

你可以将这比作惊讶的是,一个拿着铲子的人挖一个小洞的速度比反铲挖得快.反铲当然可以快速挖掘,但是如果您需要先启动反铲并移动到位,则有铲子的人可以更快地开始!

除了几个键值对(挖一个更大的洞)之外,函数 create 和 call 成本逐渐消失.在这一点上,字典理解和显式循环基本上做同样的事情:

  • 取下一个键值对,将它们弹出堆栈
  • 通过字节码操作调用 dict.__setitem__ 钩子,使用栈顶的两个项目(STORE_SUBSCRMAP_ADD.这不会不能算作函数调用",因为它都是在解释器循环中内部处理的.

这与列表推导不同,其中普通循环版本必须使用 list.append(),涉及属性查找和函数调用每次循环迭代.列表理解速度优势来自于这种差异;参见 Python 列表理解成本高

dict comprehension 增加的是,当将 b 绑定到最终的字典对象时,目标字典名称只需要查找一次.如果目标字典是一个全局而不是一个局部变量,那么理解就会获胜,毫无疑问:

>>>a = {rs(): rs() for _ in range(1000)}>>>连(一)1000>>>命名空间 = {}>>>count, total = timeit.Timer(looped, 'from __main__ import a; global b', globals=namespace).autorange()>>>(总数/计数)* 100000076.72348440100905>>>count, total = timeit.Timer(dictcomp, 'from __main__ import a; global b', globals=namespace).autorange()>>>(总数/计数)* 100000064.72114819916897>>>len(命名空间['b'])1000

所以只需使用字典理解.与 < 的区别要处理的 30 个元素太小而无法关心,而当您生成全局或拥有更多项目时,无论如何 dict 理解都会胜出.

I don't come from a software/computer science background but I love to code in Python and can generally understand why things are faster. I am really curious to know why this for loop runs faster than the dictionary comprehension. Any insights?

and the code

a = {'a':'hi','b':'hey','c':'yo'}

b = {}
for i,j in a.items():
    b[j]=i

%% timeit 932 ns ± 37.2 ns per loop

b = {v: k for k, v in a.items()}

%% timeit 1.08 µs ± 16.4 ns per loop
解决方案

You are testing with way too small an input; while a dictionary comprehension doesn't have as much of a performance advantage against a for loop when compared to a list comprehension, for realistic problem sizes it can and does beat for loops, especially when targeting a global name.

Your input consists of just 3 key-value pairs. Testing with 1000 elements instead, we see that the timings are very close instead:

>>> import timeit
>>> from random import choice, randint; from string import ascii_lowercase as letters
>>> looped = '''\
... b = {}
... for i,j in a.items():
...     b[j]=i
... '''
>>> dictcomp = '''b = {v: k for k, v in a.items()}'''
>>> def rs(): return ''.join([choice(letters) for _ in range(randint(3, 15))])
...
>>> a = {rs(): rs() for _ in range(1000)}
>>> len(a)
1000
>>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange()
>>> (total / count) * 1000000   # microseconds per run
66.62004760000855
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()
>>> (total / count) * 1000000   # microseconds per run
64.5464928005822

The difference is there, the dict comp is faster but only just at this scale. With 100 times as many key-value pairs the difference is a bit bigger:

>>> a = {rs(): rs() for _ in range(100000)}
>>> len(a)
98476
>>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange()
>>> total / count * 1000  # milliseconds, different scale!
15.48140200029593
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()
>>> total / count * 1000  # milliseconds, different scale!
13.674790799996117

which is not that big a difference when you consider both processed nearly 100k key-value pairs. Still, the for loop is clearly slower.

So why the speed difference with 3 elements? Because a comprehension (dictionary, set, list comprehensions or a generator expression) is under the hood implemented as a new function, and calling that function has a base cost the plain loop doesn't have to pay.

Here's the disassembly for the bytecode for both alternatives; note the MAKE_FUNCTION and CALL_FUNCTION opcodes in the top-level bytecode for the dict comprehension, there is a separate section for what that function then does, and there are actually very few differences in between the two approaches here:

>>> import dis
>>> dis.dis(looped)
  1           0 BUILD_MAP                0
              2 STORE_NAME               0 (b)

  2           4 SETUP_LOOP              28 (to 34)
              6 LOAD_NAME                1 (a)
              8 LOAD_METHOD              2 (items)
             10 CALL_METHOD              0
             12 GET_ITER
        >>   14 FOR_ITER                16 (to 32)
             16 UNPACK_SEQUENCE          2
             18 STORE_NAME               3 (i)
             20 STORE_NAME               4 (j)

  3          22 LOAD_NAME                3 (i)
             24 LOAD_NAME                0 (b)
             26 LOAD_NAME                4 (j)
             28 STORE_SUBSCR
             30 JUMP_ABSOLUTE           14
        >>   32 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE
>>> dis.dis(dictcomp)
  1           0 LOAD_CONST               0 (<code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<dictcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (a)
              8 LOAD_METHOD              1 (items)
             10 CALL_METHOD              0
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 STORE_NAME               2 (b)
             18 LOAD_CONST               2 (None)
             20 RETURN_VALUE

Disassembly of <code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>:
  1           0 BUILD_MAP                0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                14 (to 20)
              6 UNPACK_SEQUENCE          2
              8 STORE_FAST               1 (k)
             10 STORE_FAST               2 (v)
             12 LOAD_FAST                1 (k)
             14 LOAD_FAST                2 (v)
             16 MAP_ADD                  2
             18 JUMP_ABSOLUTE            4
        >>   20 RETURN_VALUE

The material differences: the looped code uses LOAD_NAME for b each iteration, and STORE_SUBSCR to store the key-value pair in dict loaded. The dictionary comprehension uses MAP_ADD to achieve the same thing as STORE_SUBSCR but doesn't have to load that b name each time.

But with 3 iterations only, the MAKE_FUNCTION / CALL_FUNCTION combo the dict comprehension has to execute is the real drag on the performance:

>>> make_and_call = '(lambda i: None)(None)'
>>> dis.dis(make_and_call)
  1           0 LOAD_CONST               0 (<code object <lambda> at 0x11d6ab270, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<lambda>')
              4 MAKE_FUNCTION            0
              6 LOAD_CONST               2 (None)
              8 CALL_FUNCTION            1
             10 RETURN_VALUE

Disassembly of <code object <lambda> at 0x11d6ab270, file "<dis>", line 1>:
  1           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE
>>> count, total = timeit.Timer(make_and_call).autorange()
>>> total / count * 1000000
0.12945385499915574

More than 0.1 μs to create a function object with one argument, and call it (with an extra LOAD_CONST for the None value we pass in)! And that's just about the difference between the looped and comprehension timings for 3 key-value pairs.

You can liken this to being surprised that a man with a shovel can dig a small hole faster than a backhoe can. The backhoe can certainly dig fast, but a man with a shovel can get started faster if you need to get the backhoe started and moved into position first!

Beyond a few key-value pairs (digging a bigger hole), the function create and call cost fades away into nothingness. At this point the dict comprehension and the explicit loop basically do the same thing:

  • take the next key-value pair, pop those on the stack
  • call the dict.__setitem__ hook via a bytecode operation with the top two items on the stack (either STORE_SUBSCR or MAP_ADD. This doesn't count as a 'function call' as it's all internally handled in the interpreter loop.

This is different from a list comprehension, where the plain loop version would have to use list.append(), involving an attribute lookup, and a function call each loop iteration. The list comprehension speed advantage comes from this difference; see Python list comprehension expensive

What a dict comprehension does add, is that the target dictionary name only needs to be looked up once, when binding b to the the final dictionary object. If the target dictionary is a global instead of a local variable, the comprehension wins, hands down:

>>> a = {rs(): rs() for _ in range(1000)}
>>> len(a)
1000
>>> namespace = {}
>>> count, total = timeit.Timer(looped, 'from __main__ import a; global b', globals=namespace).autorange()
>>> (total / count) * 1000000
76.72348440100905
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a; global b', globals=namespace).autorange()
>>> (total / count) * 1000000
64.72114819916897
>>> len(namespace['b'])
1000

So just use a dict comprehension. The difference with < 30 elements to process is too small to care about, and the moment you are generating a global or have more items, the dict comprehension wins out anyway.

这篇关于为什么这个循环比创建字典的字典理解更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 14:44