问题描述
DoFn.Setup
用于准备处理元素束的实例的方法的注解.
DoFn.Setup
Annotation for the method to use to prepare an instance for processing bundles of elements.
使用捆绑"一词,采用零个参数.
Uses the word "bundle", takes zero arguments.
DoFn.StartBundle
用于准备处理一批元素的实例的方法的注解.
DoFn.StartBundle
Annotation for the method to use to prepare an instance for processing a batch of elements.
使用batch"这个词,接受零个或一个参数(StartBundleContext
,一种访问PipelineOptions
的方法.
Uses the word "batch", takes zero or one arguments (StartBundleContext
, a way to access PipelineOptions
).
我需要在 DoFn 实例中初始化一个库,然后将该库用于批处理"或包"中的每个元素.我一般不会用这两个词把头发分开,但在管道中,可能会有一些区别?
I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?
推荐答案
一个DoFn
的生命周期如下:
The lifecycle of a DoFn
is as follows:
设置
- 重复处理捆绑包:
StartBundle
- 重复
ProcessElement
FinishBundle
即DoFn 的一个实例可以处理多个(零个或多个)包,并且在一个包中,它处理多个(零个或多个)元素.
I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.
Setup
/Teardown
和StartBundle
/FinishBundle
都是可选的 - 可以实现任何DoFn
而不使用它们,并且只在ProcessElement
中完成工作,但是效率很低.两种方法都允许优化:Both
Setup
/Teardown
andStartBundle
/FinishBundle
are optional - it is possible to implement anyDoFn
without using them, and with doing the work only inProcessElement
, however it will be inefficient. Both methods allow optimizations:- 人们通常希望在元素之间进行批处理,例如不是为每个元素执行 RPC,而是为 N 个元素的批次执行 RPC.
StartBundle
/FinishBundle
告诉你批处理的允许边界是什么:基本上,你不允许跨越FinishBundle
-FinishBundle
必须强制刷新您的批处理(并且StartBundle
必须初始化/重置批处理).这是我所知道的这些方法的唯一常见用法,但是如果您对更一般或更严格的解释感兴趣 - 包是容错的单位,并且运行程序假设到时间FinishBundle
返回,您已经完成了与此包中看到的所有元素相关的所有工作(输出元素或执行副作用);作品不得在捆绑包之间泄漏". - 人们通常希望管理长期存在的资源,例如网络连接.您可以在
StartBundle
/FinishBundle
中执行此操作,但是,与挂起的副作用或输出不同,此类资源可以在包之间持续存在.这就是Setup
和Teardown
的用途. - 还经常希望对
DoFn
执行代价高昂的初始化,例如解析配置文件等.这也最好在Setup
中完成.
- Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements.
StartBundle
/FinishBundle
tell you what are the allowed boundaries of batching: basically, you are not allowed to batch acrossFinishBundle
-FinishBundle
must force a flush of your batch (andStartBundle
must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the timeFinishBundle
returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles. - Often one wants to manage long-lived resources, e.g. network connections. You could do this in
StartBundle
/FinishBundle
, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's whatSetup
andTeardown
are for. - Also often one wants to perform costly initialization of a
DoFn
, e.g. parsing a config file etc. This is also best done inSetup
.
更简洁:
- 在
Setup
/Teardown
中管理资源和昂贵的初始化. - 在
StartBundle
/FinishBundle
中管理批处理工作.
- Manage resources and costly initialization in
Setup
/Teardown
. - Manage batching of work in
StartBundle
/FinishBundle
.
(在 bundle 方法中管理资源效率低下;在 setup/teardown 中管理批处理显然是错误的,会导致数据丢失)
(Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)
DoFn 文档是 最近更新 使这一点更加清晰.
The DoFn documentation was recently updated to make this more clear.
这篇关于DoFn.Setup 和 DoFn.StartBundle 有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!