为什么第二次执行时Spark查询运行得更快

为什么第二次执行时Spark查询运行得更快

本文介绍了为什么第二次执行时Spark查询运行得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第二次运行查询的速度明显更快.为什么?

The second time I run a query it's significantly faster. Why?

代码:

publicvoidtest3() {
    Dataset<Row>SQLDF=spark.read().json(path:"src/main/resources/data/ipl.json");
    SQLDF.repartition(2);
    Dataset<Row>result1=SqlDF.where("run>10000").select(col:"team",...cols:"run");
    //Dataset<Row>cachedPartition=result1.cache();
    result.collect();
    //result1.show();log.info("PhysicalPlan\n"+result1.queryExecution().executedPlan());

    Dataset<Row>result2=SqlDF.where("run>10000").select(col:"team",..cols:"run");
    result2.collect();
    //result1.show();
    Log.info("PhysicalPlan\n"+result2.queryExecution().executedPlanq);
}

身体计划:

spark UI的执行时间:

Execution time on spark UI:

为什么这些查询要花费不同的时间,为什么执行时间会有如此大的差异?缓存是在后台进行的吗?如果是,为什么在物理计划中没有提到它?

Why these queries are taking different time and why there is so much difference in execution time? Is caching happening under the hood? If yes, why it is not mentioned in physical plan?

推荐答案

您正在将Spark指向文件.第二次访问同一文件时,将更快地访问该文件.

You're pointing Spark to a file. The second time you access the same file, the file will be accessed faster.

如果您两次运行以下代码,则情况相同(当然,Scala使用JVM以及java.nio和java.io).

It's the same situation if you run the following code twice (except Scala uses the JVM and java.nio and java.io, of course).

with open("src/main/resources/data/ipl.json") as f:
    t = f.read()
print(t)

第一次,必须初始化I/O操作.第二次,I/O操作可以重用上次运行的部分内容.如果文件很小(如您所愿),则整个文件将被缓存.

The first time, the I/O operation will have to be initialized. The second time, the I/O operation can reuse parts of the last run. If the file is small (as it seems to be in your case), the whole file will have been cached.

这篇关于为什么第二次执行时Spark查询运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-11 00:28