java - 在Spark中嵌套并行化？什么是正确的方法？

嵌套的民族化？

假设我正在尝试做Spark中的“嵌套循环”。就像普通语言一样，假设我在内部循环中有一个例程，以the Pi Average Spark example does (see Estimating Pi)的方式估算Pi

i = 1000; j = 10^6; counter = 0.0;

for ( int i =0; i < iLimit; i++)
    for ( int j=0; j < jLimit ; j++)
        counter += PiEstimator();

estimateOfAllAverages = counter / i;

我可以在Spark中嵌套并行化调用吗？我正在尝试，但还没有解决。乐于张贴错误和代码，但我想我要问一个更概念性的问题，即这是否是Spark中的正确方法。

我已经可以并行化一个Spark实例/ Pi估计，现在我想执行1000次以查看它是否在Pi上收敛。（这与我们正在尝试解决的更大问题有关，如果需要更接近MVCE的东西，我很乐意补充）

底线问题我只需要一个人直接回答：使用嵌套并行化调用这是正确的方法吗？如果没有，请提出具体建议，谢谢！这是我认为正确的方法的伪代码方法：

// use accumulator to keep track of each Pi Estimate result

sparkContext.parallelize(arrayOf1000, slices).map{ Function call

     sparkContext.parallelize(arrayOf10^6, slices).map{
            // do the 10^6 thing here and update accumulator with each result
    }
}

// take average of accumulator to see if all 1000 Pi estimates converge on Pi

背景：我had asked this question and got a general answer but it did not lead to a solution，经过一番摸索后，我决定发布一个具有不同特征的新问题。我也tried to ask this on the Spark User maillist但那里也没有骰子。在此先感谢您的帮助。

最佳答案

这甚至是不可能的，因为SparkContext无法序列化。如果您想要嵌套的for循环，那么最好的选择是使用cartesian

val nestedForRDD = rdd1.cartesian(rdd2)
nestedForRDD.map((rdd1TypeVal, rdd2TypeVal) => {
  //Do your inner-nested evaluation code here
})

请记住，就像双for循环一样，这要付出一定的代价。