问题描述
我有一个 DataFrame 字段,它是一个 Seq[Seq[String]]
我构建了一个 UDF 来将所述列转换为 Seq[String] 列;基本上,一个来自 Scala 的 flatten
函数的 UDF.
I have a DataFrame field that is a Seq[Seq[String]]
I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten
function from Scala.
def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {
def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
case null => Seq.empty[String]
case _ => seqOfSeq.flatten
}
df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}
我的用例是字符串,但显然,这可能是通用的.您可以在 DataFrame 转换链中使用此函数,例如:
My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:
df.transform(combineSentences(inCol, outCol))
是否有一个 Spark 内置函数可以做同样的事情?我一直没能找到一个.
Is there a Spark built-in function that does the same thing? I have not been able to find one.
推荐答案
有一个类似的函数(从 Spark 2.4 开始),它叫做 flatten
:
There is a similar function (since Spark 2.4) and it is called flatten
:
import org.apache.spark.sql.functions.flatten
来自 官方文档:
def flatten(e: Column): Column
从数组数组创建单个数组.如果嵌套数组的结构深度超过两层,则只去除一层嵌套.
Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.
自从
2.4.0
要获得完全等效的结果,您必须coalesce
替换NULL
.
To get the exact equivalent you'll have to coalesce
to replace NULL
.
这篇关于是否有内置的 Spark 可以展平嵌套数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!