Spark 2.x-如何生成简单的解释/执行计划

本文介绍了Spark 2.x-如何生成简单的解释/执行计划的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望在Spark 2.2中生成一个解释/执行计划，并对数据帧执行一些操作.这里的目标是确保在启动作业和消耗群集资源之前按预期进行分区修剪.我在这里尝试了Spark文档搜索和SO搜索，但是找不到适合我情况的语法.

I am hoping to generate an explain/execution plan in Spark 2.2 with some actions on a dataframe. The goal here is to ensure that partition pruning is occurring as expected before I kick off the job and consume cluster resources. I tried a Spark documentation search and a SO search here but couldn't find a syntax that worked for my situation.

这是一个简单的示例，可以正常工作:

Here is a simple example that works as expected:

scala> List(1, 2, 3, 4).toDF.explain
== Physical Plan ==
LocalTableScan [value#42]

这是一个不能按预期工作但希望开始工作的示例:

Here's an example that is not working as expected but hoping to get to work:

scala> List(1, 2, 3, 4).toDF.count.explain
<console>:24: error: value explain is not a member of Long
List(1, 2, 3, 4).toDF.count.explain
                               ^

这是一个更详细的示例，进一步展示了我希望通过说明计划确认的分区修剪的最终目标.

And here's a more detailed example to further exhibit the end goal of the partition pruning that I am hoping to confirm via explain plan.

val newDf = spark.read.parquet(df).filter(s"start >= ${startDt}").filter(s"start <= ${endDt}")

提前感谢您的任何想法/反馈.

Thanks in advance for any thoughts/feedback.

推荐答案

count方法，并且如您所见返回Long，因此没有可用的执行计划.

count method is eagerly evaluated and as you see returns Long, so there is no execution plan available.

您必须使用惰性转换，或者:

You have to use a lazy transformation, either:

import org.apache.spark.sql.functions.count

df.select(count($"*"))

或

df.groupBy().agg(count($"*"))

这篇关于Spark 2.x-如何生成简单的解释/执行计划的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..