问题描述
我正在尝试将结构的字段提升"到数据帧中的顶层,如下例所示:
I am trying to 'lift' the fields of a struct to the top level in a dataframe, as illustrated by this example:
case class A(a1: String, a2: String)
case class B(b1: String, b2: A)
val df = Seq(B("X",A("Y","Z"))).toDF
df.show
+---+-----+
| b1| b2|
+---+-----+
| X|[Y,Z]|
+---+-----+
df.printSchema
root
|-- b1: string (nullable = true)
|-- b2: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
val lifted = df.withColumn("a1", $"b2.a1").withColumn("a2", $"b2.a2").drop("b2")
lifted.show
+---+---+---+
| b1| a1| a2|
+---+---+---+
| X| Y| Z|
+---+---+---+
lifted.printSchema
root
|-- b1: string (nullable = true)
|-- a1: string (nullable = true)
|-- a2: string (nullable = true)
这有效.我想创建一个小实用方法来为我做这件事,可能是通过拉皮条 DataFrame 来启用 df.lift("b2") 之类的东西.
This works. I would like to create a little utility method which does this for me, probably through pimping DataFrame to enable something like df.lift("b2").
为此,我想我想要一种获取结构中所有字段列表的方法.例如.给定 "b2" 作为输入,返回 ["a1","a2"].我该怎么做?
To do this, I think I want a way of obtaining a list of all fields within a Struct. E.g. given "b2" as input, return ["a1","a2"]. How do I do this?
推荐答案
如果我正确理解您的问题,您希望能够列出 b2 列的嵌套字段.
If I understand your question correctly, you want to be able to list the nested fields of column b2.
因此您需要过滤 b2
,访问 b2
的 StructType
,然后从字段中映射列的名称(StructField
):
So you would need to filter on b2
, access the StructType
of b2
and then map the names of the columns from within the fields (StructField
):
import org.apache.spark.sql.types.StructType
val nested_fields = df.schema
.filter(c => c.name == "b2")
.flatMap(_.dataType.asInstanceOf[StructType].fields)
.map(_.name)
// nested_fields: Seq[String] = List(a1, a2)
这篇关于struct Spark SQL 中字段的访问名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!