

我是scala和spark的新手,现在我有两个RDD,例如A是[(1,2 ,,(2,3)],B是[(4,5),(5,6)]和I想获得像[(1,2 ,,(2,3),(4,5),(5,6)]]的RDD.但是问题是我的数据很大,假设A和B均为10GB.我使用sc.union(A,B),但是速度很慢.我在Spark UI中看到此阶段有28308个任务.

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.



为什么不将两个 RDD 转换为 dataframes 并使用 union 功能.
转换为 dataframe 很容易,您只需要 import sqlContext.implicits ._ 并应用 .toDF()函数和 header names .

Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:

    val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()

    val sqlContext = sparkSession.sqlContext

    var firstTableColumns = Seq("col1", "col2")
    var secondTableColumns = Seq("col3", "col4")

    import sqlContext.implicits._

    var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)

    var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)

    firstDF = firstDF.union(secondDF)

RDDs 相比,使用数据帧应该非常容易.将 dataframe 更改为 RDD 也很容易,只需调用 .rdd 函数

It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function

val rddData = firstDF.rdd


08-30 06:23