问题描述
DoFn.Setup
该方法的注释,用于准备用于处理元素束的实例.
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.
使用单词"bundle",参数为零.
Uses the word "bundle", takes zero arguments.
DoFn.StartBundle
该方法的注释,用于准备用于处理一批元素的实例.
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.
使用单词"batch",接受零个或一个自变量( StartBundleContext
,这是一种访问PipelineOptions
的方法.)
Uses the word "batch", takes zero or one arguments (StartBundleContext
, a way to access PipelineOptions
).
我需要在DoFn实例中初始化一个库,然后将该库用于批处理"或捆绑销售"中的每个元素.我通常不会用这两个词来分开头发,但是在管道中,可能会有一些区别吗?
I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?
推荐答案
DoFn
的生命周期如下:
-
Setup
- 重复处理包:
-
StartBundle
- 重复的
ProcessElement
-
FinishBundle
Setup
- Repeatedly process bundles:
StartBundle
- Repeated
ProcessElement
FinishBundle
即DoFn的一个实例可以处理许多(零个或多个)bundle,并且在一个捆绑中,它可以处理许多(零个或多个)bundle.
I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.
Setup
/Teardown
和StartBundle
/FinishBundle
都是可选的-可以实现任何DoFn
而不使用它们,并且仅在ProcessElement
中进行工作,但是它将是效率低下.两种方法都可以进行优化:Both
Setup
/Teardown
andStartBundle
/FinishBundle
are optional - it is possible to implement anyDoFn
without using them, and with doing the work only inProcessElement
, however it will be inefficient. Both methods allow optimizations:- 通常人们想在元素之间进行批处理工作,例如代替对每个元素执行RPC,而是对N个元素的批次执行RPC.
StartBundle
/FinishBundle
告诉您批处理的允许边界是什么:基本上,您不允许跨FinishBundle
进行批处理-FinishBundle
必须强制刷新批处理(并且StartBundle
必须初始化/重置批).这是我所知道的这些方法的唯一常用用法,但是如果您对更一般或更严格的解释感兴趣-捆绑包是容错的单位,并且运行器假定到FinishBundle
返回时,您已经完全完成了与此捆绑包中所有元素相关的所有工作(输出元素或执行副作用);工作一定不能在捆绑包之间泄漏". - 通常人们想要管理长期存在的资源,例如网络连接.您可以在
StartBundle
/FinishBundle
中执行此操作,但是与待处理的副作用或输出不同,此类资源可以在束之间持久存在.这就是Setup
和Teardown
的用途. - 人们常常想对
DoFn
进行昂贵的初始化,例如解析配置文件等.这也最好在Setup
中完成.
- Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements.
StartBundle
/FinishBundle
tell you what are the allowed boundaries of batching: basically, you are not allowed to batch acrossFinishBundle
-FinishBundle
must force a flush of your batch (andStartBundle
must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the timeFinishBundle
returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles. - Often one wants to manage long-lived resources, e.g. network connections. You could do this in
StartBundle
/FinishBundle
, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's whatSetup
andTeardown
are for. - Also often one wants to perform costly initialization of a
DoFn
, e.g. parsing a config file etc. This is also best done inSetup
.
更简洁:
- 在
Setup
/Teardown
中管理资源并进行昂贵的初始化. - 在
StartBundle
/FinishBundle
中管理工作的批处理.
- Manage resources and costly initialization in
Setup
/Teardown
. - Manage batching of work in
StartBundle
/FinishBundle
.
(以捆绑方式管理资源效率低下;在设置/拆卸中管理批处理显然是不正确的,并且会导致数据丢失)
(Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)
DoFn文档为最近更新了,使其更加清晰.
The DoFn documentation was recently updated to make this more clear.
这篇关于DoFn.Setup和DoFn.StartBundle有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
-