本文介绍了DoFn.Setup和DoFn.StartBundle有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DoFn.Setup 该方法的注释,用于准备用于处理元素束的实例.

DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements.

使用单词"bundle",参数为零.

Uses the word "bundle", takes zero arguments.

DoFn.StartBundle 该方法的注释,用于准备用于处理一批元素的实例.

DoFn.StartBundle Annotation for the method to use to prepare an instance for processing a batch of elements.

使用单词"batch",接受零个或一个自变量( StartBundleContext ,这是一种访问PipelineOptions的方法.)

Uses the word "batch", takes zero or one arguments (StartBundleContext, a way to access PipelineOptions).

我需要在DoFn实例中初始化一个库,然后将该库用于批处理"或捆绑销售"中的每个元素.我通常不会用这两个词来分开头发,但是在管道中,可能会有一些区别吗?

I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?

推荐答案

DoFn的生命周期如下:

  • Setup
  • 重复处理包:
    • StartBundle
    • 重复的ProcessElement
    • FinishBundle
    • Setup
    • Repeatedly process bundles:
      • StartBundle
      • Repeated ProcessElement
      • FinishBundle

      即DoFn的一个实例可以处理许多(零个或多个)bundle,并且在一个捆绑中,它可以处理许多(零个或多个)bundle.

      I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.

      Setup/TeardownStartBundle/FinishBundle都是可选的-可以实现任何DoFn而不使用它们,并且仅在ProcessElement中进行工作,但是它将是效率低下.两种方法都可以进行优化:

      Both Setup/Teardown and StartBundle/FinishBundle are optional - it is possible to implement any DoFn without using them, and with doing the work only in ProcessElement, however it will be inefficient. Both methods allow optimizations:

      • 通常人们想在元素之间进行批处理工作,例如代替对每个元素执行RPC,而是对N个元素的批次执行RPC. StartBundle/FinishBundle告诉您批处理的允许边界是什么:基本上,您不允许跨FinishBundle进行批处理-FinishBundle必须强制刷新批处理(并且StartBundle必须初始化/重置批).这是我所知道的这些方法的唯一常用用法,但是如果您对更一般或更严格的解释感兴趣-捆绑包是容错的单位,并且运行器假定到FinishBundle返回时,您已经完全完成了与此捆绑包中所有元素相关的所有工作(输出元素或执行副作用);工作一定不能在捆绑包之间泄漏".
      • 通常人们想要管理长期存在的资源,例如网络连接.您可以在StartBundle/FinishBundle中执行此操作,但是与待处理的副作用或输出不同,此类资源可以在束之间持久存在.这就是SetupTeardown的用途.
      • 人们常常想对DoFn进行昂贵的初始化,例如解析配置文件等.这也最好在Setup中完成.
      • Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements. StartBundle/FinishBundle tell you what are the allowed boundaries of batching: basically, you are not allowed to batch across FinishBundle - FinishBundle must force a flush of your batch (and StartBundle must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the time FinishBundle returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles.
      • Often one wants to manage long-lived resources, e.g. network connections. You could do this in StartBundle/FinishBundle, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's what Setup and Teardown are for.
      • Also often one wants to perform costly initialization of a DoFn, e.g. parsing a config file etc. This is also best done in Setup.

      更简洁:

      • Setup/Teardown中管理资源并进行昂贵的初始化.
      • StartBundle/FinishBundle中管理工作的批处理.
      • Manage resources and costly initialization in Setup/Teardown.
      • Manage batching of work in StartBundle/FinishBundle.

      (以捆绑方式管理资源效率低下;在设置/拆卸中管理批处理显然是不正确的,并且会导致数据丢失)

      (Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)

      DoFn文档为最近更新了,使其更加清晰.

      The DoFn documentation was recently updated to make this more clear.

      这篇关于DoFn.Setup和DoFn.StartBundle有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 07:06