问题描述
我在 GCP Dataflow/Apache Beam 中有一个 PCollection.而不是一一处理,我需要按N"组合.类似于grouped(N)
.因此,在有界处理的情况下,它将按批次分组 10 个项目,最后一个批次分组.这在 Apache Beam 中可行吗?
I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N)
. So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?
推荐答案
编辑,看起来像:Google Dataflow元素计数精确"聚合
您应该能够通过将元素分配给全局窗口并使用 AfterPane.elementCountAtLeast(N)
来做类似的事情.您仍然需要考虑如果没有足够的元素来触发触发器会怎样.你可以用这个:
You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N)
. You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:
Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(N),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))
但是您首先应该问自己为什么需要这种启发式方法,可能有更多 idomatice 方法可以解决您的问题.阅读 Beam 的编程指南中的数据驱动触发器
But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers
in Beam’s programming guide
这篇关于Beam/Dataflow 中的批量 PCollection的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!