问题描述
我可以使用以下代码在两个RDD中打印数据.
I am able to print data in two RDD with the below code.
usersRDD.foreach(println)
empRDD.foreach(println)
我需要比较两个RDD中的数据.如何在一个RDD中将字段数据与另一个RDD中的字段数据进行迭代和比较.例如:对记录进行迭代,并检查userRDD
中的名称和年龄是否与empRDD
中的记录匹配,如果没有放在单独的RDD中.
I need to compare data in two RDDs. How can I iterate and compare field data in one RDD with field data in another RDD. Eg: iterate the records and check if name and age in userRDD
has a matching record in empRDD
, if no put in separate RDD.
我尝试使用userRDD.substract(empRDD)
,但是它正在比较所有字段.
I tried with userRDD.substract(empRDD)
but it was comparing all the fields.
推荐答案
您需要在每个RDD中键入数据,以便有一些东西可以连接记录.看看例如groupBy
.然后,您join
生成的RDD.对于每个键,您都会在两个键中获得匹配的值.如果您有兴趣查找不匹配的密钥,请使用leftOuterJoin
,如下所示:
You need to key the data in each RDD so that there is something to join records on. Have a look at groupBy
for example. Then you join
the resulting RDDs. For each key, you get the matching values in both. If you are interested in finding the unmatched keys, use leftOuterJoin
, like this:
// Returns the entries in userRDD that have no corresponding key in empRDD.
def nonEmp(userRDD: RDD[(String, String)], empRDD: RDD[(String, String)]) = {
userRDD.leftOuterJoin(empRDD).collect {
case (name, (age, None)) => name -> age
}
}
这篇关于比较Spark中两个RDD中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!