问题描述
Apache Beam 最近通过 StateSpec
和 @StateId
注释引入了状态单元,部分支持 Apache Flink 和 Google Cloud Dataflow.
Apache Beam has recently introduced state cells, through StateSpec
and the @StateId
annotation, with partial support in Apache Flink and Google Cloud Dataflow.
我的问题是关于状态垃圾收集的,在有状态 DoFn 用于窗口流的情况下.通常,当窗口到期(即水印通过窗口末尾)时,运行程序会删除(垃圾收集)状态.但是,考虑一下窗口窗格被提前触发,而被触发的窗格被丢弃的情况:
My question is about state garbage collection, in the case where a stateful DoFn is used on a windowed stream. Typically, state is removed (garbage collected) by the runner when the window expires (i.e. the watermark passes the end of the window). However, consider the case where window panes are triggered early, and the fired panes are discarded:
input.apply(Window.<MyElement>into(CalendarWindows.days(1))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10))
))
.discardingFiredPanes()
.apply(ParDo.of(new MyStatefulDofn()));
在这种情况下,早期触发的键的状态是否会保留到窗口到期后?即同一窗口中的后续窗格是否可以访问由早期窗格写入的状态?
In this case, would the state for the keys which were fired early be kept until after the window expires? i.e. would subsequent panes in the same window have access to state written by earlier panes?
推荐答案
您的触发配置不会影响 ParDo
的有状态处理的进行方式.元素会立即提供给您的 DoFn
,无需任何缓冲/触发,您的 DoFn
直接控制何时发生输出.
Your triggering configuration does not affect how stateful processing of a ParDo
proceeds. The elements are provided immediately to your DoFn
without any buffering/triggering and your DoFn
directly controls when output occurs.
您控制输出这一事实是有状态的 ParDo
处理和由触发器控制的 Combine.perKey
之间的重要区别.这就是为什么当触发器对于您的用例来说不够丰富时,有状态的 ParDo
通常是一个不错的选择.
The fact that you control the output is an important difference between stateful ParDo
processing and Combine.perKey
governed by triggers. This is why stateful ParDo
is often a good choice when triggers are not rich enough for your use case.
我在 Beam 博客上的帖子中更详细地比较了有状态 ParDo
处理与 Combine
+ 触发器:https://beam.apache.org/blog/2017/02/13/stateful-processing.html
I compare stateful ParDo
processing with Combine
+ triggers in some more detail in my post on the Beam blog: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
现在,如果在有状态的 ParDo
上游某处有 GroupByKey
或 Combine.perKey
,那么输入元素将与一些相关联触发从上游发射.但这并不影响您的有状态 ParDo
的状态是如何管理的.由于状态是跨元素持久化的,而窗格"只是一个元素,状态会一直保持到窗口完全过期.
Now, if there is a GroupByKey
or Combine.perKey
somewhere upstream from your stateful ParDo
, then input elements will be associated with some trigger firing from upstream. But this does not affect how the state for your stateful ParDo
is managed. As state is persisted across elements, and a "pane" is just an element, state is maintained until the window expires fully.
顺便提一下,非常好的总结导致了您的问题!
Very nice summary leading up to your question, by the way!
这篇关于Beam 中的状态处理 - 状态是否在窗格之间共享?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!