问题描述
基本上,我正在寻找一种使用python3协程而不是线程或进程提供并行映射的东西.我相信执行高度并行的IO工作时应该减少开销.
Basically, I'm looking for something that offers a parallel map using python3 coroutines as the backend instead of threads or processes. I believe there should be less overhead when performing highly parallel IO work.
肯定是已经存在类似的东西了吗,无论是在标准库中还是在广泛使用的软件包中?
Surely something similar already exists, be it in the standard library or some widely used package?
推荐答案
免责声明 PEP 0492 仅定义了协程的语法和用法.它们需要运行事件循环,很可能是 asyncio
的事件循环.
DISCLAIMER PEP 0492 defines only syntax and usage for coroutines. They require an event loop to run, which is most likely asyncio
's event loop.
我不知道任何基于协程的map
实现.但是,使用 asyncio.gather()
:
I don't know any implementation of map
based on coroutines. However it's trivial to implement basic map
functionality using asyncio.gather()
:
def async_map(coroutine_func, iterable):
loop = asyncio.get_event_loop()
future = asyncio.gather(*(coroutine_func(param) for param in iterable))
return loop.run_until_complete(future)
此实现非常简单.它为iterable
中的每个项目创建一个协程,将它们合并为单个协程,并在事件循环中执行已合并的协程.
This implementation is really simple. It creates a coroutine for each item in the iterable
, joins them into single coroutine and executes joined coroutine on event loop.
提供的实现涵盖部分案例.但是,这有一个问题.如果长期迭代,您可能希望限制并行运行的协程数量.我无法提出简单的实现方式,该实现方式高效并且可以同时保留顺序,因此我将其留给读者练习.
Provided implementation covers part of the cases. However it has a problem. With long iterable you would probably want to limit amount of coroutines running in parallel. I can't come up with simple implementation, which is efficient and preserves order at the same time, so I will leave it as an exercise for a reader.
您声称:
这需要证明,因此这里比较了multiprocessing
实现和 ap 和我的gevent
实现基于协程的实现.所有测试均在Python 3.5上执行.
It requires proof, so here is a comparison of multiprocessing
implementation, gevent
implementation by a p and my implementation based on coroutines. All tests were performed on Python 3.5.
使用multiprocessing
的实现:
from multiprocessing import Pool
import time
def async_map(f, iterable):
with Pool(len(iterable)) as p: # run one process per item to measure overhead only
return p.map(f, iterable)
def func(val):
time.sleep(1)
return val * val
使用gevent
的实现:
import gevent
from gevent.pool import Group
def async_map(f, iterable):
group = Group()
return group.map(f, iterable)
def func(val):
gevent.sleep(1)
return val * val
使用asyncio
的实现:
import asyncio
def async_map(f, iterable):
loop = asyncio.get_event_loop()
future = asyncio.gather(*(f(param) for param in iterable))
return loop.run_until_complete(future)
async def func(val):
await asyncio.sleep(1)
return val * val
通常使用测试程序timeit
:
$ python3 -m timeit -s 'from perf.map_mp import async_map, func' -n 1 'async_map(func, list(range(10)))'
结果:
-
10
个项目的可迭代项:
-
multiprocessing
-1.05秒 -
gevent
-1秒 -
asyncio
-1秒
multiprocessing
- 1.05 secgevent
- 1 secasyncio
- 1 sec
100
个项目的可迭代项:
-
multiprocessing
-1.16秒 -
gevent
-1.01秒 -
asyncio
-1.01秒
multiprocessing
- 1.16 secgevent
- 1.01 secasyncio
- 1.01 sec
500
个项目的可迭代项:
-
multiprocessing
-2.31秒 -
gevent
-1.02秒 -
asyncio
-1.03秒
multiprocessing
- 2.31 secgevent
- 1.02 secasyncio
- 1.03 sec
5000
个项目的可迭代项:
-
multiprocessing
-失败(产生5k进程不是一个好主意!) -
gevent
-1.12秒 -
asyncio
-1.22秒
multiprocessing
- failed (spawning 5k processes is not so good idea!)gevent
- 1.12 secasyncio
- 1.22 sec
50000
个项目的可迭代项:
-
gevent
-2.2秒 -
asyncio
-3.25秒
gevent
- 2.2 secasyncio
- 3.25 sec
结论
当程序主要执行I/O而不是执行计算时,基于事件循环的并发工作速度更快.请记住,当I/O更少且涉及更多计算时,这种差异将更小.
Conclusions
Concurrency based on event loop works faster, when program do mostly I/O, not computations. Keep in mind, that difference will be smaller, when there are less I/O and more computations are involved.
与基于事件循环的并发引入的开销相比,生成过程引入的开销要大得多.这意味着您的假设是正确的.
Overhead introduced by spawning processes is significantly bigger, than overhead introduced by event loop based concurrency. It means that your assumption is correct.
比较asyncio
和gevent
,我们可以说asyncio
的开销大33-45%.这意味着,创建绿色小子比创建协程便宜.
Comparing asyncio
and gevent
we can say, that asyncio
has 33-45% bigger overhead. It means that creation of greenlets is cheaper, than creation of coroutines.
最后的结论:gevent
具有更好的性能,但是asyncio
是标准库的一部分.性能差异(绝对数字)不是很明显. gevent
是一个相当成熟的库,而asyncio
是一个相对较新的库,但是它进步很快.
As a final conclusion: gevent
has better performance, but asyncio
is part of the standard library. Difference in performance (absolute numbers) isn't very significant. gevent
is quite mature library, while asyncio
is relatively new, but it advances quickly.
这篇关于在Python中,是否有一个异步于multiprocessing或current.futures?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!