本文介绍了Python timeit 的惊人结果:Counter() vs defaultdict() vs dict()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 timeit 得到了非常令人惊讶的结果,有人能告诉我我做错了什么吗?我使用的是 Python 2.7.

这是文件speedtest_init.py的内容:

随机导入to_count = [random.randint(0, 100) for r in range(60)]

这些是speedtest.py的内容:

__author__ = 'BlueTrin'导入时间def test_init1():打印(timeit.timeit('导入speedtest_init'))def test_counter1():s = """d = defaultdict(int);对于我在 speedtest_init.to_count 中:d[i] += 1"""打印(timeit.timeit(s,'从集合导入defaultdict;导入speedtest_init;'))def test_counter2():print(timeit.timeit('d = Counter(speedtest_init.to_count);', 'from collections import Counter; import speedtest_init;'))如果 __name__ == "__main__":test_init1()test_counter1()test_counter2()

控制台输出为:

C:Python27python.exe C:/Dev/codility/chlorum2014/speedtest.py2.7150196293165.709044450391.2953839048进程完成,退出代码 0

我认为默认情况下 timeit() 运行代码的 1000000 倍,所以我需要将时间除以 1000000,但令人惊讶的是 Counter 比 defaultdict() 慢.

这是预期的吗?

同样使用 dict 比使用 defaultdict(int) 更快:

def test_counter3():s = """d = {};对于我在 speedtest_init.to_count 中:如果我不在 d:d[i] = 1别的:d[i] += 1"""打印(timeit.timeit(stmt=s, setup='from collections import defaultdict; import speedtest_init;')

最后一个版本比 defaultdict(int) 更快,这意味着除非您更关心可读性,否则您应该使用 dict() 而不是 defaultdict().

解决方案

是的,这是意料之中的;Counter() constructor 使用 Counter.update() 它使用 self.get() 加载初始值而不是依赖 __missing__.

此外,defaultdict __missing__ 工厂完全在 C 代码中处理,尤其是在使用另一种类型(如自身实现的 int())时在 C 中.Counter 源是纯 Python,因此 Counter.__missing__ 方法需要一个 Python 框架来执行.

因为 dict.get() 仍然在 C 中处理,构造方法是 Counter() 的更快方法,前提是您使用相同的技巧 Counter.update() 使用和别名 self.get 查找作为本地优先:

>>>导入时间>>>随机导入>>>导入系统>>>sys.version_infosys.version_info(major=2,minor=7,micro=9,releaselevel='final',serial=0)>>>to_count = [random.randint(0, 100) for r in range(60)]>>>timeit.timeit('for i in to_count: c[i] += 1',...'从集合导入计数器;从 __main__ 导入到_count;c = 计数器()',...数=10000)0.2510359287261963>>>timeit.timeit('for i in to_count: c[i] = c_get(i, 0) + 1',...'从集合导入计数器;从 __main__ 导入到_count;c = 计数器();c_get = c.get',...数=10000)0.20978617668151855

defaultdictCounter 都是为其功能而非性能而构建的有用类;不依赖 __missing__ 钩子可以更快:

>>>timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',... '从 __main__ 导入到_count;d = {};d_get = d.get',...数=10000)0.11437392234802246

这是一个常规字典,使用别名 dict.get() 方法以获得最大速度.但是,您还必须重新实现 CounterCounter.most_common() 方法的包行为.defaultdict 用例远不止计数.

在 Python 3.2 中,通过添加处理这种情况的 C 库,更新 Counter() 获得了速度提升;请参阅问题 10667.在 Python 3.4 上测试,Counter() 构造函数现在击败了别名 dict.get 案例:

>>>timeit.timeit('计数器(to_count)',...'从集合导入计数器;从 __main__ 导入到_count',...数=100000)0.8332311600097455>>>timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',... '从 __main__ 导入到_count;d = {};d_get = d.get',...数=100000)0.961191965994658>>>导入系统>>>sys.version_infosys.version_info(major=3,minor=4,micro=2,releaselevel='final',serial=0)

(注意:为了获得有意义的计时结果,迭代次数从 10k 增加到 100k;因此,如果您将这些与上面的 dict.get() 情况进行比较,您需要采取计时有十次,在 1.144 秒).

I obtained very surprising results with timeit, can someone tell me if I am doing something wrong ? I am using Python 2.7.

This is the contents of file speedtest_init.py:

import random

to_count = [random.randint(0, 100) for r in range(60)]

These are the contents of speedtest.py:

__author__ = 'BlueTrin'

import timeit

def test_init1():
    print(timeit.timeit('import speedtest_init'))

def test_counter1():
    s = """
    d = defaultdict(int);
    for i in speedtest_init.to_count:
        d[i] += 1
    """
    print(timeit.timeit(s, 'from collections import defaultdict; import speedtest_init;'))

def test_counter2():
    print(timeit.timeit('d = Counter(speedtest_init.to_count);', 'from collections import Counter; import speedtest_init;'))


if __name__ == "__main__":
    test_init1()
    test_counter1()
    test_counter2()

The console output is:

C:Python27python.exe C:/Dev/codility/chlorum2014/speedtest.py
2.71501962931
65.7090444503
91.2953839048

Process finished with exit code 0

I think by default timeit() runs 1000000 times the code, so I need to divide the times by 1000000, but what is surprising is that the Counter is slower than the defaultdict().

Is that expected ?

EDIT:

Also using a dict is faster than a defaultdict(int):

def test_counter3():
    s = """
    d = {};
    for i in speedtest_init.to_count:
        if i not in d:
            d[i] = 1
        else:
            d[i] += 1
    """
    print(timeit.timeit(stmt=s, setup='from collections import defaultdict; import speedtest_init;')

this last version is faster than the defaultdict(int) meaning that unless you care more about readability you should use the dict() rather than the defaultdict().

解决方案

Yes, this is expected; the Counter() constructor uses Counter.update() which uses self.get() to load initial values rather than rely on __missing__.

Moreover, the defaultdict __missing__ factory is handled entirely in C code, especially when using another type like int() that is itself implemented in C. The Counter source is pure Python and as such the Counter.__missing__ method requires a Python frame to execute.

Because dict.get() is still handled in C, the constructor approach is the faster approach for a Counter(), provided you use the same trick Counter.update() uses and alias the self.get lookup as a local first:

>>> import timeit
>>> import random
>>> import sys
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
>>> to_count = [random.randint(0, 100) for r in range(60)]
>>> timeit.timeit('for i in to_count: c[i] += 1',
...               'from collections import Counter; from __main__ import to_count; c = Counter()',
...               number=10000)
0.2510359287261963
>>> timeit.timeit('for i in to_count: c[i] = c_get(i, 0) + 1',
...               'from collections import Counter; from __main__ import to_count; c = Counter(); c_get = c.get',
...               number=10000)
0.20978617668151855

Both defaultdict and Counter are helpful classes built for their functionality, not their performance; not relying on the __missing__ hook can be faster still:

>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
...               'from __main__ import to_count; d = {}; d_get = d.get',
...               number=10000)
0.11437392234802246

That's a regular dictionary using an aliased dict.get() method for maximum speed. But then you'll also have to re-implement the bag behaviour of Counter, or the Counter.most_common() method. The defaultdict use cases go way beyond counting.

In Python 3.2, updating a Counter() got a speed boost by adding a C library that handles this case; see issue 10667. Testing on Python 3.4, the Counter() constructor now beats the aliased dict.get case:

>>> timeit.timeit('Counter(to_count)',
...               'from collections import Counter; from __main__ import to_count',
...               number=100000)
0.8332311600097455
>>> timeit.timeit('for i in to_count: d[i] = d_get(i, 0) + 1',
...               'from __main__ import to_count; d = {}; d_get = d.get',
...               number=100000)
0.961191965994658
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)

(Note: to get a meaningful timing result the number of iterations was increased from 10k to 100k; so if you compare these against the dict.get() case above you need to take the timing there times ten, at 1.144 seconds).

这篇关于Python timeit 的惊人结果:Counter() vs defaultdict() vs dict()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 15:41