本文介绍了struct Spark SQL 中字段的访问名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将结构的字段提升"到数据帧中的顶层,如下例所示:

I am trying to 'lift' the fields of a struct to the top level in a dataframe, as illustrated by this example:

case class A(a1: String, a2: String)
case class B(b1: String, b2: A)

val df = Seq(B("X",A("Y","Z"))).toDF

df.show    
+---+-----+
| b1|   b2|
+---+-----+
|  X|[Y,Z]|
+---+-----+

df.printSchema
root
 |-- b1: string (nullable = true)
 |-- b2: struct (nullable = true)
 |    |-- a1: string (nullable = true)
 |    |-- a2: string (nullable = true)

val lifted = df.withColumn("a1", $"b2.a1").withColumn("a2", $"b2.a2").drop("b2")

lifted.show
+---+---+---+
| b1| a1| a2|
+---+---+---+
|  X|  Y|  Z|
+---+---+---+

lifted.printSchema
 root
 |-- b1: string (nullable = true)
 |-- a1: string (nullable = true)
 |-- a2: string (nullable = true)

这有效.我想创建一个小实用方法来为我做这件事,可能是通过拉皮条 DataFrame 来启用 df.lift("b2") 之类的东西.

This works. I would like to create a little utility method which does this for me, probably through pimping DataFrame to enable something like df.lift("b2").

为此,我想我想要一种获取结构中所有字段列表的方法.例如.给定 "b2" 作为输入,返回 ["a1","a2"].我该怎么做?

To do this, I think I want a way of obtaining a list of all fields within a Struct. E.g. given "b2" as input, return ["a1","a2"]. How do I do this?

推荐答案

如果我正确理解您的问题,您希望能够列出 b2 列的嵌套字段.

If I understand your question correctly, you want to be able to list the nested fields of column b2.

因此您需要过滤 b2,访问 b2StructType,然后从字段中映射列的名称(StructField):

So you would need to filter on b2, access the StructType of b2 and then map the names of the columns from within the fields (StructField):

import org.apache.spark.sql.types.StructType

val nested_fields = df.schema
                   .filter(c => c.name == "b2")
                   .flatMap(_.dataType.asInstanceOf[StructType].fields)
                   .map(_.name)

// nested_fields: Seq[String] = List(a1, a2)

这篇关于struct Spark SQL 中字段的访问名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-18 08:52