问题描述
我必须将rdd与它的类型匹配.
I have to match an rdd with its types.
trait Fruit
case class Apple(price:Int) extends Fruit
case class Mango(price:Int) extends Fruit
现在将出现类型为DStream[Fruit]
的dstream.它是Apple
或Mango
.
Now a dstream of type DStream[Fruit]
is coming. It is either Apple
or Mango
.
如何基于子类执行操作?类似于以下内容(无效):
How to perform operation based on the subclass? Something like the below (which doesn't work):
dStream.foreachRDD{rdd:RDD[Fruit] =>
rdd match {
case rdd: RDD[Apple] =>
//do something
case rdd: RDD[Mango] =>
//do something
case _ =>
println(rdd.count() + "<<<< not matched anything")
}
}
推荐答案
由于我们有RDD[Fruit]
,所以任何行都可以是Apple
或Mango
.使用foreachRDD
时,每个RDD
都将包含这些(以及其他可能的)类型的混合.
Since we have an RDD[Fruit]
, any row can be either Apple
or Mango
. When using foreachRDD
, each RDD
will contain a mix of these (and possible other) types.
要区分不同类型,我们可以使用 collect[U](f: PartialFunction[T, U]): RDD[U]
(不要与collect(): Array[T]
混淆,后者会返回包含RDD中的元素的列表).通过应用函数f
,此函数将返回包含所有匹配值的RDD(在这种情况下,我们可以在此处使用模式匹配).
To differentiate between the different types, we can use collect[U](f: PartialFunction[T, U]): RDD[U]
(this is not to be confused with collect(): Array[T]
that returns a list with the elements from the RDD).This function will return an RDD that contains all matching values by applying a function f
(in this case, we can use a pattern match here).
下面是一个小的说明性示例(也将Orange
添加到了水果中).
Below follows a small illustrative example (adding Orange
to the fruits as well).
设置:
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val inputData: Queue[RDD[Fruit]] = Queue()
val dStream: InputDStream[Fruit] = ssc.queueStream(inputData)
inputData += spark.sparkContext.parallelize(Seq(Apple(5), Apple(5), Mango(11)))
inputData += spark.sparkContext.parallelize(Seq(Mango(10), Orange(1), Orange(3)))
这将创建具有两个单独的RDD
的RDD[Fruit]
流.
This creates a stream of RDD[Fruit]
with two separate RDD
s.
dStream.foreachRDD{rdd: RDD[Fruit] =>
val mix = rdd.collect{
case row: Apple => ("APPLE", row.price) // do any computation on apple rows
case row: Mango => ("MANGO", row.price) // do any computation on mango rows
//case _@row => do something with other rows (will be removed by default).
}
mix foreach println
}
在上面的collect
中,我们稍微更改每行(删除类),然后打印结果RDD
.结果:
In the above collect
, we change each row slightly (removing the class) and then prints the resulting RDD
. Result:
// First RDD
(MANGO,11)
(APPLE,5)
(APPLE,5)
// Second RDD
(MANGO,10)
可以看出,模式匹配保留并更改了包含Apple
和Mango
的行,同时删除了所有Orange
类.
As can be seen, the pattern match have kept and changed the rows containing Apple
and Mango
while removing all Orange
classes.
单独的RDD
如果需要,也可以如下将两个子类分成各自的RDD
.然后可以在这两个RDD
上执行任何计算.
If wanted, it is also possible to separate the two subclasses into their own RDD
s as follows. Any computations can then be performed on these two RDD
s.
val apple = rdd.collect{case row: Apple => row}
val mango = rdd.collect{case row: Mango => row}
完整的示例代码
trait Fruit
case class Apple(price:Int) extends Fruit
case class Mango(price:Int) extends Fruit
case class Orange(price:Int) extends Fruit
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[*]").getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val inputData: Queue[RDD[Fruit]] = Queue()
val inputStream: InputDStream[Fruit] = ssc.queueStream(inputData)
inputData += spark.sparkContext.parallelize(Seq(Apple(5), Apple(5), Mango(11)))
inputData += spark.sparkContext.parallelize(Seq(Mango(10), Orange(1), Orange(3)))
inputStream.foreachRDD{rdd:RDD[Fruit] =>
val mix = rdd.collect{
case row: Apple => ("APPLE", row.price) // do any computation on apple rows
case row: Mango => ("MANGO", row.price) // do any computation on mango rows
//case _@row => do something with other rows (will be removed by default).
}
mix foreach println
}
ssc.start()
ssc.awaitTermination()
}
}
这篇关于如何在Apache Spark中将RDD [ParentClass]与RDD [Subclass]匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!