本文介绍了DoFn.Setup 和 DoFn.StartBundle 有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DoFn.Setup 用于准备处理元素束的实例的方法的注解.

DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements.

使用捆绑"一词,采用零个参数.

Uses the word "bundle", takes zero arguments.

DoFn.StartBundle 用于准备处理一批元素的实例的方法的注解.

DoFn.StartBundle Annotation for the method to use to prepare an instance for processing a batch of elements.

使用batch"这个词,接受零个或一个参数(StartBundleContext,一种访问PipelineOptions的方法.

Uses the word "batch", takes zero or one arguments (StartBundleContext, a way to access PipelineOptions).

我需要在 DoFn 实例中初始化一个库,然后将该库用于批处理"或包"中的每个元素.我一般不会用这两个词把头发分开,但在管道中,可能会有一些区别?

I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?

推荐答案

一个DoFn的生​​命周期如下:

The lifecycle of a DoFn is as follows:

  • 设置
  • 重复处理捆绑包:
    • StartBundle
    • 重复ProcessElement
    • FinishBundle

    即DoFn 的一个实例可以处理多个(零个或多个)包,并且在一个包中,它处理多个(零个或多个)元素.

    I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.

    Setup/TeardownStartBundle/FinishBundle 都是可选的 - 可以实现任何 DoFn 而不使用它们,并且只在 ProcessElement 中完成工作,但是效率很低.两种方法都允许优化:

    Both Setup/Teardown and StartBundle/FinishBundle are optional - it is possible to implement any DoFn without using them, and with doing the work only in ProcessElement, however it will be inefficient. Both methods allow optimizations:

    • 人们通常希望在元素之间进行批处理,例如不是为每个元素执行 RPC,而是为 N 个元素的批次执行 RPC.StartBundle/FinishBundle 告诉你批处理的允许边界是什么:基本上,你不允许跨越 FinishBundle - FinishBundle 必须强制刷新您的批处理(并且 StartBundle 必须初始化/重置批处理).这是我所知道的这些方法的唯一常见用法,但是如果您对更一般或更严格的解释感兴趣 - 包是容错的单位,并且运行程序假设到时间 FinishBundle 返回,您已经完成了与此包中看到的所有元素相关的所有工作(输出元素或执行副作用);作品不得在捆绑包之间泄漏".
    • 人们通常希望管理长期存在的资源,例如网络连接.您可以在 StartBundle/FinishBundle 中执行此操作,但是,与挂起的副作用或输出不同,此类资源可以在包之间持续存在.这就是 SetupTeardown 的用途.
    • 还经常希望对 DoFn 执行代价高昂的初始化,例如解析配置文件等.这也最好在 Setup 中完成.
    • Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements. StartBundle/FinishBundle tell you what are the allowed boundaries of batching: basically, you are not allowed to batch across FinishBundle - FinishBundle must force a flush of your batch (and StartBundle must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the time FinishBundle returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles.
    • Often one wants to manage long-lived resources, e.g. network connections. You could do this in StartBundle/FinishBundle, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's what Setup and Teardown are for.
    • Also often one wants to perform costly initialization of a DoFn, e.g. parsing a config file etc. This is also best done in Setup.

    更简洁:

    • Setup/Teardown 中管理资源和昂贵的初始化.
    • StartBundle/FinishBundle 中管理批处理工作.
    • Manage resources and costly initialization in Setup/Teardown.
    • Manage batching of work in StartBundle/FinishBundle.

    (在 bundle 方法中管理资源效率低下;在 setup/teardown 中管理批处理显然是错误的,会导致数据丢失)

    (Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)

    DoFn 文档是 最近更新 使这一点更加清晰.

    The DoFn documentation was recently updated to make this more clear.

    这篇关于DoFn.Setup 和 DoFn.StartBundle 有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 07:07