Spark中的迭代缓存与检查点

本文介绍了Spark中的迭代缓存与检查点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark上运行了一个迭代应用程序，并将其简化为以下代码:

I have an iterative application running on Spark that I simplified to the following code:

var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000))
var c: Long = Int.MaxValue
var iteration: Int = 0
while (c > 0) {
    iteration += 1
    // Manipulate the RDD and cache the new RDD
    anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint()
    // Actually compute the RDD and spawn a new job
    c = anRDD.count()
    println(s"Iteration: $iteration, Values: $c")
}

后续作业中的内存分配将如何处理?

What happens to the memory allocation within consequent jobs?

当前anRDD是替代"先前的还是全部保留在内存中?从长远来看，这可能会引发一些内存异常
localCheckpoint和cache是否具有不同的行为?如果使用localCheckpoint代替cache，因为localCheckpoint会截断RDD世系，那么我希望以前的RDD会被覆盖

Does the current anRDD "override" the previous ones or are they all kept into memory? In the long run, this can throw some memory exception
Do localCheckpoint and cache have different behaviors? If localCheckpoint is used in place of cache, as localCheckpoint truncates the RDD lineage, then I would expect the previous RDDs to be overridden

lineage

Spark中的迭代缓存与检查点

问题描述

推荐答案