本文介绍了Beam/Dataflow中的批次PCollection的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在GCP Dataflow/Apache Beam中有一个PCollection.而不是一个接一个地处理它,我需要组合"by N". grouped(N)之类的东西.因此,在有限处理的情况下,它将按批处理将10个项目分组,最后剩下的则按最后一批进行分组.在Apache Beam中有可能吗?

I have a PCollection in GCP Dataflow/Apache Beam. Instead of processing it one by one, I need to combine "by N". Something like grouped(N). So, in case of bounded processing, it will group by 10 items in batch and last batch with whatever left. Is this possible in Apache Beam?

推荐答案

编辑,如下所示: Google Dataflow "elementCountExact";聚合

通过将元素分配给全局窗口并使用AfterPane.elementCountAtLeast(N),您应该能够执行类似的操作.您仍然需要考虑如果没有足够的元素触发触发器该怎么办.您可以使用:

You should be able to do something similar by assigning elements to global window and using AfterPane.elementCountAtLeast(N). You still need to account for what what if there isn’t enough elements to fire the trigger. You could use this:

 Repeatedly.forever(AfterFirst.of(
  AfterPane.elementCountAtLeast(N),
  AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(X))))

但是您首先应该问自己为什么需要这种启发式方法,可能有更多的idomatice方法可以解决您的问题.在 Beam的编程指南

But you should ask yourself why do you need this heuristic in the first place, there probably is more idomatice way to solve your problem. Read about Data-Driven Triggers in Beam’s programming guide

这篇关于Beam/Dataflow中的批次PCollection的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 15:32