我试图了解为numSlices
中的parallelize()
方法提供不同的SparkContext
的效果。下面给出的是方法的Syntax
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)
(implicit arg0: ClassTag[T]): RDD[T]
我在本地模式下运行了spark-shell
spark-shell --master local
我的理解是,
numSlices
决定生成的RDD的分区号(在调用sc.parallelize()
之后)。考虑下面的几个例子情况1
scala> sc.parallelize(1 to 9, 1);
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.partitions.size
res2: Int = 1
情况二
scala> sc.parallelize(1 to 9, 2);
res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> res3.partitions.size
res4: Int = 2
情况3
scala> sc.parallelize(1 to 9, 3);
res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
scala> res3.partitions.size
res6: Int = 2
情况4
scala> sc.parallelize(1 to 9, 4);
res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22
scala> res3.partitions.size
res8: Int = 2
问题1:在情况3和情况4中,我期望分区大小分别为
3
和4
,但是两种情况的分区大小仅为2
。这是什么原因呢?问题2:在每种情况下,都有一个与
ParallelCollectionRDD[no]
相关的数字。例如,在情况1中为ParallelCollectionRDD[0]
,在情况2中为ParallelCollectionRDD[1]
,依此类推。这些数字到底代表什么? 最佳答案
问题1:这是您的错字。您正在呼叫res3.partitions.size
,而不是分别呼叫res5
和res7
。当我用正确的数字进行操作时,它会按预期工作。
问题2:这是Spark上下文中RDD的ID,用于使图形保持笔直。看看我三遍运行同一命令时会发生什么:
scala> sc.parallelize(1 to 9,1)
res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22
scala> sc.parallelize(1 to 9,1)
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22
现在,存在具有三个不同ID的三个不同RDD。我们可以运行以下命令进行检查:
scala> (res0.id, res1.id, res2.id)
res3: (Int, Int, Int) = (0,1,2)
关于apache-spark - SparkContext中的parallelize()方法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/33788600/