A more complete solution borrowed from this answer:val rdd = sc.textFile("gene.txt")// create the sliding 3 grams for each partition and record the edgesval rdd1 = rdd.flatMap(_.split("")).mapPartitionsWithIndex((i, iter) => { val slideList = iter.toList.sliding(3).toList Iterator((slideList, (slideList.head, slideList.last)))})// collect the edge values, concatenate edges from adjacent partitions and broadcast itval edgeValues = rdd1.values.collectval sewedEdges = edgeValues zip edgeValues.tail map { case (x, y) => { (x._2 ++ y._1).drop(1).dropRight(1).sliding(3).toList}}val sewedEdgesMap = sc.broadcast( (0 until rdd1.partitions.size) zip sewedEdges toMap)// sew the edge values back to the resultrdd1.keys.mapPartitionsWithIndex((i, iter) => iter ++ List(sewedEdgesMap.value.getOrElse(i, Nil))). flatMap(_.map(_ mkString "")).collect// res54: Array[String] = Array(ATG, TGT, GTA, TAT, ATA, TAC, ACA, CAT, ATA, TAT, ATA, TAT, ATA, TAT, ATA, TAT) 这篇关于多线火花推拉窗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-30 08:16