我试图了解为numSlices中的parallelize()方法提供不同的SparkContext的效果。下面给出的是方法的Syntax

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]


我在本地模式下运行了spark-shell

spark-shell --master local


我的理解是,numSlices决定生成的RDD的分区号(在调用sc.parallelize()之后)。考虑下面的几个例子

情况1

scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1


情况二

scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2


情况3

scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2


情况4

scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2


问题1:在情况3和情况4中,我期望分区大小分别为34,但是两种情况的分区大小仅为2。这是什么原因呢?

问题2:在每种情况下,都有一个与ParallelCollectionRDD[no]相关的数字。例如,在情况1中为ParallelCollectionRDD[0],在情况2中为ParallelCollectionRDD[1],依此类推。这些数字到底代表什么?

最佳答案

问题1:这是您的错字。您正在呼叫res3.partitions.size,而不是分别呼叫res5res7。当我用正确的数字进行操作时,它会按预期工作。

问题2:这是Spark上下文中RDD的ID,用于使图形保持笔直。看看我三遍运行同一命令时会发生什么:

scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22

scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22


现在,存在具有三个不同ID的三个不同RDD。我们可以运行以下命令进行检查:

scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)

关于apache-spark - SparkContext中的parallelize()方法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33788600/

10-12 22:43