问题描述
我是scala和spark的新手,现在我有两个RDD,例如A是[(1,2 ,,(2,3)],B是[(4,5),(5,6)]和I想获得像[(1,2 ,,(2,3),(4,5),(5,6)]]的RDD.但是问题是我的数据很大,假设A和B均为10GB.我使用sc.union(A,B),但是速度很慢.我在Spark UI中看到此阶段有28308个任务.
I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
有没有更有效的方法?
推荐答案
为什么不将两个 RDD
转换为 dataframes
并使用 union 代码>功能.
转换为 dataframe
很容易,您只需要 import sqlContext.implicits ._
并应用 .toDF()
函数和 header names
.
例如:
Why don't you convert the two RDDs
to dataframes
and use union
function.
Converting to dataframe
is easy you just need to import sqlContext.implicits._
and apply .toDF()
function with header names
.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
与 RDDs
相比,使用数据帧
应该非常容易.将 dataframe
更改为 RDD
也很容易,只需调用 .rdd
函数
It should be very easy for you to work with dataframes
than with RDDs
. Changing dataframe
to RDD
is quite easy too, just call .rdd
function
val rddData = firstDF.rdd
这篇关于在火花中有效使用工会的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!