减少JDBC读取并行性的合并

减少JDBC读取并行性的合并

本文介绍了减少JDBC读取并行性的合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我利用SparkJDBC功能如下:

  • MySQL表读入DataFrame
  • 转换它们
  • Coalesce 他们
  • 将它们写入HDFS
  • Read MySQL tables into DataFrame
  • Transform them
  • Coalesce them
  • Write them to HDFS

DataFrame的整个生命周期中,不执行action..它曾经按预期工作,但最近我遇到了问题.由于Spark lazy评估coalesce导致读取操作的并行度降低.

Throughout the lifespan of DataFrame, no actions are performed on it. It used to work as expected but lately I've been experiencing problems. Thanks to Spark's lazy evaluation, the coalesce is resulting in reduced parallelism of read operation.

因此,如果我使用DataFrameReader.jdbc(..numPartitions..)numPartitions=42读取DataFrame,然后在写入前将coalesce保持为6 partition s,则它会以并发性读取DataFrame (仅6个)(仅向MySQL发出6个查询).我想更早地重复使用它以42的 parallelism 进行读取,然后再执行coalesce.

So if I read DataFrame using DataFrameReader.jdbc(..numPartitions..) with numPartitions=42, and then coalesce it to 6 partitions before writing, then it reads the DataFrame with a concurrency of only 6 (fire only 6 queries to MySQL). I'd like to repeat that earlier it used read with parallelism of 42 and perform coalesce afterwards.

我最近迁移到EMR 5.13上的Spark 2.3.0,这可能与此有关吗?有解决方法吗?

I've recently migrated to Spark 2.3.0 on EMR 5.13, could this be related to that? Is there a workaround?

推荐答案

与懒惰无关. coalesce有意不创建分析障碍:

It has nothing to do with laziness. coalesce intentionally doesn't create analysis barrier:

因此,只需遵循文档并使用repartition而不是coalesce.

So just follow the documentation and use repartition instead of coalesce.

这篇关于减少JDBC读取并行性的合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 23:35