问题描述
我有两个rdd,一个rdd,只有一列,另外有两列,将键上的两个RDD连接起来,我添加了虚拟值0,请问还有其他有效的方法可以使用join吗?
I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
修改:
让我用SQL转换此问题.举例来说,我有 table1(移动影片)
和 table2(电影编号,电影名称)
.在SQL中,我们编写如下内容:
Let me convert this question in SQL. Say for example I have table1 (moveid)
and table2 (movieid,moviename)
. In SQL we write something like:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
在SQL中, table1
只有一列,而 table2
仍然有两列, join
仍然有效,Spark中的相同方法可以在来自两个RDD的密钥.
here in SQL table1
has only one column where as table2
has two columns still the join
works, same way in Spark can join on keys from both the RDD's.
推荐答案
联接操作仅在 PairwiseRDD
上定义,这与SQL中的关系/表完全不同. PairwiseRDD
的每个元素都是一个 Tuple2
,其中第一个元素是 key
,第二个元素是 value
.只要 key
提供有意义的 hashCode
Join operation is defined only on PairwiseRDDs
which are quite different from a relation / table in SQL. Each element of PairwiseRDD
is a Tuple2
where the first element is the key
and the second is value
. Both can contain complex objects as long as key
provides a meaningful hashCode
如果您想以SQL形式考虑这一点,则可以考虑将键视为必需,因为转到 ON
子句的所有内容和 value
都包含选定的列.
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON
clause and value
contains selected columns.
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
虽然这些方法乍一看看起来很相似,但是您可以使用另一种方法来表达它们,但是有一个根本的区别.当您查看SQL表并忽略约束时,所有列都属于同一类对象,而 PairwiseRDD
中的 key
和 value
清楚的意思.
While these approaches look similar at first glance and you can express one using another there is one fundamental difference. When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key
and value
in the PairwiseRDD
have a clear meaning.
回到您的问题以使用 join
时,您同时需要 key
和 value
.可以说,使用 null
单例比使用 0
作为占位符要干净得多,但实际上没有办法解决.
Going back to your problem to use join
you need both key
and value
. Arguably much cleaner than using 0
as a placeholder would be to use null
singleton but there is really no way around it.
对于小数据,您可以通过类似的方式使用过滤器广播连接:
For small data you can use filter in a similar way to broadcast join:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
但是,如果您确实希望使用SQL类的连接,则只需使用SparkSQL.
but if you really want SQL-ish joins then you should simply use SparkSQL.
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
这篇关于加入两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!