如何在循环中覆盖RDD

本文介绍了如何在循环中覆盖RDD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对Spark和Scala还是陌生的，并且正在实现一种处理大图的迭代算法.假设在for循环中，我们有两个RDD(rdd1和rdd2)，并且它们的值会更新.例如:

I am very new to Spark and Scala and I am implementing an iterative algorithm that manipulates a big graph. Assume that inside a for loop, we have two RDDs (rdd1 and rdd2) and their value get updated. for example something like:

for (i <- 0 to 5){
   val rdd1 = rdd2.some Transformations
   rdd2 = rdd1
}

因此，基本上，在迭代i + 1期间，将基于rdi1在迭代i时的值来计算rdd1的值.我知道RDD是不可变的，因此我无法真正将任何东西重新分配给他们，但是我只是想知道，我的想法有可能实现或不实现.如果是这样，怎么办?任何帮助，我们将不胜感激.

so basically, during iteration i+1 the value of rdd1 is computed based on its value at iteration i.I know that RDDs are immutable so I can not really reassign anything to them, but I just wanted to know, what I have in mind is possible to implement or not. If so, how? Any help is greatly appreciated.

谢谢

已更新:当我尝试此代码时:

updated:when I try this code:

var size2 = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))

for(i <- 0 to 5){
    var size2 = size2.map(y=> readyForExpandFunc(y))
}
size2.collect()

它给了我这个错误:递归变量size2需要类型"我不确定这是什么意思

it is giving me this error: "recursive variable size2 needs type"I am not sure what it means

推荐答案

只需打开一个火花壳并尝试一下:

Just open a spark-shell and try it:

scala> var rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> for( i <- 0 to 5 ) { rdd1 = rdd1.map( _ + 1 ) }

scala> rdd1.collect()
res1: Array[Int] = Array(7, 8, 9, 10, 11)

如您所见，它可以正常工作.

as you can see, it works.

这篇关于如何在循环中覆盖RDD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！