问题描述
我有两个RDD.
rdd1 =(字符串,字符串)
rdd1 = (String, String)
key1, value11
key2, value12
key3, value13
rdd2 =(字符串,字符串)
rdd2 = (String, String)
key2, value22
key3, value23
key4, value24
我需要使用rdd1和rdd2中合并的行来形成另一个RDD,输出应如下所示:
I need to form another RDD with merged rows from rdd1 and rdd2, the output should look like:
key2, value12 ; value22
key3, value13 ; value23
因此,基本上什么也没什么,只是取rdd1和rdd2的键的交集然后联接它们的值.**值应为value(rdd1)+ value(rdd2)的顺序,而不是相反.
So, basically it's nothing but taking the intersection of the keys of rdd1 and rdd2 and then join their values.** The values should be in order i.e. value(rdd1) + value(rdd2) and not reverse.
推荐答案
我认为这可能是您想要的:
I think this may be what you are looking for:
join(otherDataset, [numTasks])
在类型为(K,V)和(K,W)的数据集上调用时,将返回(K,(V,W))对的数据集,其中每个键都有所有成对的元素.通过leftOuterJoin,rightOuterJoin和fullOuterJoin支持外部联接.
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
这篇关于在Spark Scala中合并两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!