本文介绍了从 Scala 中的 Dataframe 的嵌套结构数组中选择几列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个带有结构数组的数据框,并且在另一个结构数组中.有什么简单的方法可以在不干扰整个数据帧结构的情况下选择主数组中的少数结构以及嵌套数组中的少数结构?
I have a dataframe with array of struct and inside that another array of struct. Any easy way to select few of the structs in the main array and also few in the nested array without disturbing the structure of the entire dataframe?
简单输入:
-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6
简单的输出:
-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5
尝试的源代码如下
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType
val arrayStructData = Seq(
Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
)
val arrayStructSchema = new StructType()
.add("Category",StringType)
.add("ArmyOrNavy",ArrayType(new StructType()
.add("ID",StringType)
.add("Type",StringType)
.add("Count",StringType)
.add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
df.show(false)
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Count: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemName: string (nullable = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]] |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
我需要的输出是
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
我尝试做这样的事情,但这看起来不对
I tried doing something like this but this doesn't look right
val df2 = df.selectExpr("Category",
"Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items)) as ArmyOrNavy")
df2.printSchema
df2.show(false)
推荐答案
您可以使用 to_json
和 from_json
并设置新结构 DateType
> 解析 json 时用于结构字段(array):
You can do it using to_json
and from_json
and set new struct DateType
for struct field (array) while parsing json:
val newArrayType = ArrayType(
new StructType()
.add("ID", StringType)
.add("Items", ArrayType(
new StructType()
.add("ItemTrainingArea", StringType)
))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
.withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
.drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)
// output
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+----------------------------------------------------------------------+
|Army |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]] |
+--------+----------------------------------------------------------------------+
这篇关于从 Scala 中的 Dataframe 的嵌套结构数组中选择几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!