问题描述
我想知道 Apache Beam.Google DataFlow 是否足够智能,可以识别数据流图中的重复转换并只运行一次.例如,如果我有 2 个分支:
I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:
- p |GroupByKey() |FlatMap(...)
- p |combiners.Top.PerKey(...) |FlatMap(...)
两者都涉及在引擎盖下按键对元素进行分组.执行引擎是否会识别 GroupByKey() 在两种情况下具有相同的输入并且只运行一次?或者我是否需要手动确保 GroupByKey() 在这种情况下继续使用它的所有分支?
both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?
推荐答案
正如您所推断的,此行为依赖于运行程序.每个运行器实现自己的优化逻辑.
As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.
- Dataflow Runner 目前不支持此优化.
- The Dataflow Runner does not currently support this optimization.
这篇关于优化 Apache Beam/DataFlow 中的重复转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!