本文介绍了什么时候创建RDD血统?如何找到谱系图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Apache Spark,并尝试获取RDD的沿袭图.但是我找不到何时创建特定血统?另外,在哪里可以找到RDD的血统?

I am learning Apache Spark and trying to get the lineage graph of the RDDs.But i could not find when does a particular lineage is created?Also, where to find the lineage of an RDD?

推荐答案

RDD沿袭是每次对应用转换时都会创建和扩展的分布式计算的逻辑执行计划任何 RDD.

请注意逻辑"部分不是物理"的执行动作后就会发生这种情况.

Note the part "logical" not "physical" that happens after you've executed an action.

引用掌握Apache Spark 2 gitbook:

Quoting Mastering Apache Spark 2 gitbook:

因此,RDD谱系图是调用动作后需要执行哪些转换的图.

A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.

任何RDD都有RDD血统,即使这意味着RDD血统只是单个节点,即RDD本身.那是因为RDD可能是也可能不是一系列转换的结果(而且没有转换是零效应"转换:))

Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))

您可以使用 RDD.toDebugString :

val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []

val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
 +-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
    |  MapPartitionsRDD[1] at map at <console>:25 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:24 []

这篇关于什么时候创建RDD血统?如何找到谱系图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 11:41