问题描述
我对Spark和Scala还是陌生的,并且正在实现一种处理大图的迭代算法.假设在for循环中,我们有两个RDD(rdd1和rdd2),并且它们的值会更新.例如:
I am very new to Spark and Scala and I am implementing an iterative algorithm that manipulates a big graph. Assume that inside a for loop, we have two RDDs (rdd1 and rdd2) and their value get updated. for example something like:
for (i <- 0 to 5){
val rdd1 = rdd2.some Transformations
rdd2 = rdd1
}
因此,基本上,在迭代i + 1期间,将基于rdi1在迭代i时的值来计算rdd1的值.我知道RDD是不可变的,因此我无法真正将任何东西重新分配给他们,但是我只是想知道,我的想法有可能实现或不实现.如果是这样,怎么办?任何帮助,我们将不胜感激.
so basically, during iteration i+1 the value of rdd1 is computed based on its value at iteration i.I know that RDDs are immutable so I can not really reassign anything to them, but I just wanted to know, what I have in mind is possible to implement or not. If so, how? Any help is greatly appreciated.
谢谢
已更新:当我尝试此代码时:
updated:when I try this code:
var size2 = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))
for(i <- 0 to 5){
var size2 = size2.map(y=> readyForExpandFunc(y))
}
size2.collect()
它给了我这个错误:递归变量size2需要类型"我不确定这是什么意思
it is giving me this error: "recursive variable size2 needs type"I am not sure what it means
推荐答案
只需打开一个火花壳并尝试一下:
Just open a spark-shell and try it:
scala> var rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> for( i <- 0 to 5 ) { rdd1 = rdd1.map( _ + 1 ) }
scala> rdd1.collect()
res1: Array[Int] = Array(7, 8, 9, 10, 11)
如您所见,它可以正常工作.
as you can see, it works.
这篇关于如何在循环中覆盖RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!