问题描述
我在GCP Dataflow/Apache Beam中有一个PCollection.而不是一个接一个地处理它,我需要组合"by N". grouped(N)
之类的东西.因此,在有限处理的情况下,它将按批处理将10个项目分组,最后剩下的则按最后一批进行分组.在Apache Beam中有可能吗?
I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N)
. So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?
推荐答案
编辑,如下所示: Google Dataflow "elementCountExact";聚合
通过将元素分配给全局窗口并使用AfterPane.elementCountAtLeast(N)
,您应该能够执行类似的操作.您仍然需要考虑如果没有足够的元素触发触发器该怎么办.您可以使用:
You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N)
. You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:
Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(N),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))
但是您首先应该问自己为什么需要这种启发式方法,可能有更多的idomatice方法可以解决您的问题.在 Beam的编程指南
But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers
in Beam’s programming guide
这篇关于Beam/Dataflow中的批次PCollection的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!