本文介绍了Spark unionAll 多个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一组数据框

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

把我所有的人都结合起来

to union all of them I do

df1.unionAll(df2).unionAll(df3)

对于任意数量的数据帧,是否有更优雅和可扩展的方式来执行此操作,例如来自

Is there a more elegant and scalable way of doing this for any number of dataframes, for example from

Seq(df1, df2, df3)

推荐答案

最简单的解决方案是 reduceunion (unionAll in Spark

The simplest solution is to reduce with union (unionAll in Spark < 2.0):

val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)

这是相对简洁的,不应该从堆外存储中移动数据需要非线性时间来执行计划分析.如果您尝试合并大量 DataFrames,可能会出现什么问题.

This is relatively concise and shouldn't move data from off-heap storage requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames.

您也可以转换为 RDDs 并使用 SparkContext.union:

You can also convert to RDDs and use SparkContext.union:

dfs match {
  case h :: Nil => Some(h)
  case h :: _   => Some(h.sqlContext.createDataFrame(
                     h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
                     h.schema
                   ))
  case Nil  => None
}

它使 的分析成本保持在较低的水平,但与直接合并 DataFrames 相比效率较低.

It keeps analysis cost low but otherwise it is less efficient than merging DataFrames directly.

这篇关于Spark unionAll 多个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:49