在大型对象列表上，多处理Pool.map()的缩放比例较差:如何在python中实现更好的并行缩放比例?

Q : Pool.map() really passing the content of those big sub list around processes which cause additional copy?是的，必须这样做，另外，设计使它通过将所有数据通过" 传递到另一个昂贵的" SER/DES处理，以使其在那里" 交付.Yes,it must do so, by designplus it does that by passing all that data "through" another "expensive" SER/DES processing,so as to make it happen delivered "there".因为远程(参数接收)过程是另一个完全自治的过程，它具有自己的，独立的且受保护的地址空间，所以我们不能仅传递 address-reference em>"into"，我们希望它是一个完全独立的，可自主运行的python进程(由于愿意使用此技巧，以便逃避 GIL锁跳舞)，不是吗? 我们确实做到了-这是我们摆脱GIL战争的重要一步(为了更好地了解GIL锁的优缺点，可能像此和 this (有关CPU限制的处理，请参见第15页).Because the remote ( parameter-receiving ) process is another, fully autonomous process, with its own, separate and protected, address-space we cannot just pass an address-reference "into", and we wanted that to be a fully independent, autonomously working python process ( due to a will to use this trick so as to escape from GIL-lock dancing ), didn't we? Sure we did - this is a central step of our escape from the GIL-Wars ( for better understanding of the GIL-lock pros and cons, may like this and this ( Pg.15+ on CPU-bound processing ). 0.1 ns - NOP 0.3 ns - XOR, ADD, SUB 0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies ) 0.9 ns - JMP SHORT 1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~ 3~4 ns - CPU L2 CACHE reference (2020/Q1) 5 ns - CPU L1 iCACHE Branch mispredict 7 ns - CPU L2 CACHE reference 10 ns - DIV 19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake) 71 ns - CPU cross-QPI/NUMA best case on XEON E5-46* 100 ns - MUTEX lock/unlock 100 ns - own DDR MEMORY reference 135 ns - CPU cross-QPI/NUMA best case on XEON E7-* 202 ns - CPU cross-QPI/NUMA worst case on XEON E7-* 325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46* 10,000 ns - Compress 1K bytes with a Zippy PROCESS 20,000 ns - Send 2K bytes over 1 Gbps NETWORK 250,000 ns - Read 1 MB sequentially from MEMORY 500,000 ns - Round trip within a same DataCenter?~~~ 2,500,000 ns - Read 10 MB sequentially from MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it? 10,000,000 ns - DISK seek 10,000,000 ns - Read 1 MB sequentially from NETWORK?~~ 25,000,000 ns - Read 100 MB sequentially from MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s) 30,000,000 ns - Read 1 MB sequentially from a DISK?~~ 36,000,000 ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem 150,000,000 ns - Send a NETWORK packet CA -> Netherlands | | | | | | | ns| | | us| | ms| A ) 了解避免或降低开支的方法: 了解您必须支付的费用，并且将要支付的费用 :A )UNDERSTAND THE WAYS TO AVOID OR AT LEAST REDUCE EXPENSES : Understand all the types of the costs you have to pay and will pay : 花费低过程实例化费用尽可能(相当昂贵)，最好是一次性费用spend as low process instantiation costs as possible (rather expensive ) best as a one-time cost only 尽可能少地花费参数传递的费用(是的，最好避免重复传递那些"大物件"作为参数)spend as small amount of costs of parameter-passing as you must ( yes, best avoid repetitive passing those "large things" as parameters ) B ) 了解提高效率的方式: 即使增加了代码复杂性，也要理解所有提高效率的技巧(一些SLOC-易于在教科书中显示，但同时牺牲了效率和性能-尽管这两者都是您的主要敌人，为在整个缩放(问题大小或迭代深度，或者同时扩大它们的深度).B ) UNDERSTAND THE WAYS TO INCREASE THE EFFICIENCY :Understand all efficiency increasing tricks, even at a cost of complexity of code ( a few SLOC-s are easy to show in school-books, yet sacrificing both the efficiency and the performance - in spite of these both being your main enemy in a fight for a sustainable performance throughout the scaling ( of either of problem size or iteration depths, or when growing both of them at the same time ). A 中的某些类别的实际费用已大大更改了限额通过进入某种形式的[PARALLEL]流程编排(在这里，使代码执行的某些部分在生成的子流程中执行)可以预期达到的加速其中最早由Gene Amdahl博士于60年前提出(为此，最近又添加了与 setup 相关的过程实例化的两个主要扩展) > + 终止增加费用(对于py2 always和py3.5 +对于MacOS和Windows而言极为重要)和atomicity-of-work，下面将对此进行讨论. /p>阿姆达尔定律加速比S的开销上限重新形成:Some categories of the real-world costs from A ) have dramatically changed the limits of the theoretically achievable speedups to be expected from going into some form of [PARALLEL] process orchestrations ( here, making some parts of the code-execution got executed in the spawned sub-processes ), the initial view of which was first formulated by Dr. Gene Amdahl as early as 60+ years ago ( for which there were recently added two principal extensions of both the process instantiation(s) related setup + termination add on costs ( extremely important in py2 always & py3.5+ for MacOS and Windows ) and an atomicity-of-work, which will be discussed below.S = speedup which can be achieved with N processorss = a proportion of a calculation, which is [SERIAL]1-s = a parallelizable portion, that may run [PAR]N = a number of processors ( CPU-cores ) actively participating on [PAR] processing 1S = __________________________; where s, ( 1 - s ), N were defined above ( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on cost/latency s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on cost/latency N开销上限和资源感知的重新配制: 1 where s, ( 1 - s ), NS = ______________________________________________ ; pSO, pTO | ( 1 - s ) | were defined above s + pSO + max| _________ , atomicP | + pTO atomicP:= a unit of work, | N | further indivisible, a duration of an atomic-process-block使用python在目标CPU/RAM设备上的原型，缩放为>> 1E+6任何简化的模型化示例都会以某种方式使您对实际工作负荷在体内的执行方式的期望产生偏差.低估的RAM分配(在小规模范围内看不到)可能会在以后大范围出乎人们的意料，有时甚至使操作系统陷入呆滞状态，进行交换和颠簸.一些更聪明的工具(numba.jit())甚至可以分析代码，并捷径某些代码段，这些代码段将永远不会被访问或不会产生任何结果，因此请注意，简化的示例可能会导致令人惊讶的观察.Prototype on target CPU/RAM device with your python, scaled >>1E+6Any simplified mock-up example will somehow skew your expectations about how the actual workloads will perform in-vivo. Underestimated RAM-allocations, not seen at small-scales may later surprise at scale, sometimes even throwing the operating system into sluggish states, swapping and thrashing. Some smarter tools ( numba.jit() ) may even analyze the code and shortcut some passages of code, that will never be visited or that does not produce any result, so be warned that simplified examples may lead to surprising observations.from multiprocessing import Poolimport numpy as npimport osSCALE = int( 1E9 )STEP = int( 1E1 )aLIST = np.random.random( ( 10**3, 10**4 ) ).tolist()######################################################################################## func() does some SCALE'd amount of work, yet# passes almost zero bytes as parameters# allocates nothing, but iterator# returns one byte,# invariant to any expensive inputsdef func( x ): for i in range( SCALE ): i**2 return 1一些使扩展策略的开销成本降低的提示:A few hints on making the strategy of scaling less overhead-costs expensive :###################################################################################### more_work_en_block() wraps some SCALE'd amount of work, sub-list specifieddef more_work_en_block( en_block = [ None, ] ): return [ func( nth_item ) for nth_item in en_block ]如果确实必须通过一个较大的列表，则最好通过较大的块，并通过远程迭代其部分来进行((与使用sub_blocks相比，不必为每次通过的每个项目支付更多的转让成本，(使用get SER参数/DES处理(〜pickle.dumps() + pickle.loads()的成本)[每次调用]，再次，以附加成本进行，这降低了结果效率并恶化了扩展的，开销严格的开销中的开销部分阿姆达尔定律)If indeed must pass a big list, better pass larger block, with remote-iterating its parts ( instead of paying transfer-costs for each and every item passed many many more times, than if using sub_blocks ( parameters get SER/DES processed ( ~ the costs of pickle.dumps() + pickle.loads() ) [per-each-call], again, at an add-on costs, that decrease the resulting efficiency and worsen the overheads part of the extended, overhead-strict Amdahl's Law )###################################################################################### some_work_en_block() wraps some SCALE'd amount of work, tuple-specifieddef some_work_en_block( sub_block = ( [ None, ], 0, 1 ) ): return more_work_en_block( en_block = sub_block[0][sub_block[1]:sub_block[2]] )适当调整流程实例的数量:aMaxNumOfProcessesThatMakesSenseToSPAWN = len( os.sched_getaffinity( 0 ) ) # never morewith Pool( aMaxNumOfProcessesThatMakesSenseToSPAWN ) as p: p.imap_unordered( more_work_en_block, [ ( aLIST, start, start + STEP ) for start in range( 0, len( aLIST ), STEP ) ] )最后但并非最不重要的一点是，期望通过巧妙地使用numpy智能矢量化代码来显着提高性能，最好在没有重复传递静态，预复制的情况下(在过程实例化过程中，因此以合理的比例进行支付)不可避免的成本)BLOB，以矢量化(CPU效率非常高)的方式，作为只读数据在代码中使用，而无需通过参数传递来传递相同的数据.关于如何使~ +500 x加速的一些示例，可以在此处或此处，关于但 ~ +400 x加速，或者只是 ~ +100 x加速的情况>，其中包含一些问题隔离示例，例如测试方案.Last but not least, expect immense performance boosts from smart use of numpy smart vectorised code, best without repetitive passing of static, pre-copied (during the process instantiation(s), thus paid as the reasonably scaled, here un-avoidable, cost of thereof ) BLOBs, used in the code without passing the same data via parameter-passing, in a vectorised ( CPU-very-efficient ) fashion as read-only data. Some examples on how one can make ~ +500 x speedup one may read here or here, about but ~ +400 x speedup or about a case of just about a ~ +100 x speedup, with some examples of some problem-isolation testing scenarios.无论如何，模拟代码与您的实际工作负载越接近，基准就越具有(规模化和生产化)的意义.Anyway, the closer will the mock-up code be to your actual workloads, the more sense the benchmarks will get to have ( at scale & in production ). 这篇关于在大型对象列表上，多处理Pool.map()的缩放比例较差:如何在python中实现更好的并行缩放比例?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！ 1403页，肝出来的..