在Spark Scala中合并两个RDD

本文介绍了在Spark Scala中合并两个RDD的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个RDD.

rdd1 =(字符串，字符串)

rdd1 = (String, String)

key1, value11
key2, value12
key3, value13

rdd2 =(字符串，字符串)

rdd2 = (String, String)

key2, value22
key3, value23
key4, value24

我需要使用rdd1和rdd2中合并的行来形成另一个RDD，输出应如下所示:

I need to form another RDD with merged rows from rdd1 and rdd2, the output should look like:

key2, value12 ; value22
key3, value13 ; value23

因此，基本上什么也没什么，只是取rdd1和rdd2的键的交集然后联接它们的值.**值应为value(rdd1)+ value(rdd2)的顺序，而不是相反.

So, basically it's nothing but taking the intersection of the keys of rdd1 and rdd2 and then join their values.** The values should be in order i.e. value(rdd1) + value(rdd2) and not reverse.

推荐答案

我认为这可能是您想要的:

I think this may be what you are looking for:

join(otherDataset, [numTasks])

在类型为(K，V)和(K，W)的数据集上调用时，将返回(K，(V，W))对的数据集，其中每个键都有所有成对的元素.通过leftOuterJoin，rightOuterJoin和fullOuterJoin支持外部联接.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

请参阅文档的相关部分

这篇关于在Spark Scala中合并两个RDD的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！