可以展平嵌套数组

可以展平嵌套数组

本文介绍了是否有内置的 Spark 可以展平嵌套数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame 字段,它是一个 Seq[Seq[String]] 我构建了一个 UDF 来将所述列转换为 Seq[String] 列;基本上,一个来自 Scala 的 flatten 函数的 UDF.

I have a DataFrame field that is a Seq[Seq[String]] I built a UDF to transform said column into a column of Seq[String]; basically, a UDF for the flatten function from Scala.

def combineSentences(inCol: String, outCol: String): DataFrame => DataFrame = {

    def flatfunc(seqOfSeq: Seq[Seq[String]]): Seq[String] = seqOfSeq match {
        case null => Seq.empty[String]
        case _ => seqOfSeq.flatten
    }
    df: DataFrame => df.withColumn(outCol, udf(flatfunc _).apply(col(inCol)))
}

我的用例是字符串,但显然,这可能是通用的.您可以在 DataFrame 转换链中使用此函数,例如:

My use case is strings, but obviously, this could be generic. You can use this function in a chain of DataFrame transforms like:

df.transform(combineSentences(inCol, outCol))

是否有一个 Spark 内置函数可以做同样的事情?我一直没能找到一个.

Is there a Spark built-in function that does the same thing? I have not been able to find one.

推荐答案

有一个类似的函数(从 Spark 2.4 开始),它叫做 flatten:

There is a similar function (since Spark 2.4) and it is called flatten:

import org.apache.spark.sql.functions.flatten

来自 官方文档:

def flatten(e: Column): Column

从数组数组创建单个数组.如果嵌套数组的结构深度超过两层,则只去除一层嵌套.

Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.

自从

2.4.0

要获得完全等效的结果,您必须coalesce 替换NULL.

To get the exact equivalent you'll have to coalesce to replace NULL.

这篇关于是否有内置的 Spark 可以展平嵌套数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-02 23:21