本文介绍了优化Apache Beam/DataFlow中的重复转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道Apache Beam.Google DataFlow是否足够聪明以识别数据流图中的重复转换并仅运行一次.例如,如果我有2个分支:

  • p | GroupByKey()| FlatMap(...)
  • p | Combiners.Top.PerKey(...)| FlatMap(...)
两者都将在引擎盖下按键对元素进行分组.执行引擎会在两种情况下都识别出GroupByKey()具有相同的输入并且仅运行一次吗?还是我需要手动确保GroupByKey()在这种情况下继续使用它的所有分支?

解决方案

您可能已经推断出,此行为与跑步者有关.每个跑步者都实现自己的优化逻辑.

  • 数据流运行器当前不支持此优化.

I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:

  • p | GroupByKey() | FlatMap(...)
  • p | combiners.Top.PerKey(...) | FlatMap(...)

both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?

解决方案

As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.

  • The Dataflow Runner does not currently support this optimization.

这篇关于优化Apache Beam/DataFlow中的重复转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 15:32