问题描述
每个RDD都指向同一个谱系图
Do each RDD point to the same lineage graph
或
当父 RDD 将其谱系赋予新的 RDD 时,谱系图也是由子代复制的,因此父和子具有不同的图.在这种情况下是不是内存密集型?
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
推荐答案
每个 RDD 都维护一个指向一个或多个父级的指针,以及关于它与父级的关系类型的元数据.例如,当我们在 RDD 上调用 val b = a.map() 时,RDD b
只保留对其父级 a
的引用(并且从不复制)>,这是一个血统.
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b
just keeps a reference (and never copies) to its parent a
, that's a lineage.
当驱动程序提交作业时,RDD 图被序列化到工作节点,以便每个工作节点在不同的分区上应用一系列转换(如映射过滤器等).此外,如果发生某些故障,此 RDD 谱系将用于重新计算数据.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
为了显示 RDD 的谱系,Spark 提供了一个调试方法 toDebugString()
方法.
To display the lineage of an RDD, Spark provides a debug method toDebugString()
method.
考虑以下示例,
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
在 splitedLines
RDD 上执行 toDebugString()
,将输出以下内容,
Executing toDebugString()
on splitedLines
RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
有关 Spark 内部工作原理的更多信息,请阅读我的另一篇帖子
For more information about how Spark works internally, please read my another post
这篇关于沿袭如何在 Apache Spark 的 RDD 中传递的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!