问题描述
具有以下模式:
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
如何添加新字段这样吗?
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- New_field: integer (nullable = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
我已经用一个简单的结构完成了(有关更多详细信息,请参见本文的底部),但我无法使用结构体数组来实现。
I've already done that with a simple struct (more detail at the bottom of this post), but I'm not able to do it with an array of struct.
这是测试代码:
val schema = new StructType()
.add("Elems", ArrayType(new StructType()
.add("Elem", IntegerType)
.add("Desc", StringType)
))
val dataDS = Seq("""
{
"Elems": [ {"Elem":1, "Desc": "d1"}, {"Elem":2, "Desc": "d2"}, {"Elem":3, "Desc": "d3"} ]
}
""").toDS()
val df = spark.read.schema(schema).json(dataDS.rdd)
df.show(false)
+---------------------------+
|Elems |
+---------------------------+
|[[1, d1], [2, d2], [3, d3]]|
+---------------------------+
一旦有了DF,我最好的方法就是为每个元素创建数组结构:
Once we have the DF, the best approach I have is creating a Struct of arrays for each element:
val mod_df = df.withColumn("modif_elems",
struct(
array(lit("")).as("New_field"),
col("Elems.Elem"),
col("Elems.Desc")
))
mod_df.show(false)
+---------------------------+-----------------------------+
|Elems |modif_elems |
+---------------------------+-----------------------------+
|[[1, d1], [2, d2], [3, d3]]|[[], [1, 2, 3], [d1, d2, d3]]|
+---------------------------+-----------------------------+
mod_df.printSchema
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
|-- modif_elems: struct (nullable = false)
| |-- New_field: array (nullable = false)
| | |-- element: string (containsNull = false)
| |-- Elem: array (nullable = true)
| | |-- element: integer (containsNull = true)
| |-- Desc: array (nullable = true)
| | |-- element: string (containsNull = true)
我们不会丢失任何数据,但这是
We don't lose any data but this is not exactly what I want.
更新:PD1中的解决方法。
Update: Workaround in PD1.
代码几乎相同但是现在我们没有结构体数组,因此修改结构体会更容易:
The code is almost the same but now we don't have an array of struct, so it's easier to modify the struct:
val schema = new StructType()
.add("Elems", new StructType()
.add("Elem", IntegerType)
.add("Desc", StringType)
)
val dataDS = Seq("""
{
"Elems": {"Elem":1, "Desc": "d1"}
}
""").toDS()
val df = spark.read.schema(schema).json(dataDS.rdd)
df.show(false)
+-------+
|Elems |
+-------+
|[1, d1]|
+-------+
df.printSchema
root
|-- Elems: struct (nullable = true)
| |-- Elem: integer (nullable = true)
| |-- Desc: string (nullable = true)
在这种情况下,为了添加字段我们需要创建另一个结构:
In this case, in order to add the field we need to create another struct:
val mod_df = df
.withColumn("modif_elems",
struct(
lit("").alias("New_field"),
col("Elems.Elem"),
col("Elems.Desc")
)
)
mod_df.show
+-------+-----------+
| Elems|modif_elems|
+-------+-----------+
|[1, d1]| [, 1, d1]|
+-------+-----------+
mod_df.printSchema
root
|-- Elems: struct (nullable = true)
| |-- Elem: integer (nullable = true)
| |-- Desc: string (nullable = true)
|-- modif_elems: struct (nullable = false)
| |-- New_field: string (nullable = false)
| |-- Elem: integer (nullable = true)
| |-- Desc: string (nullable = true)
PD1:
好,我用过 Spark SQL函数(2.4.0版本中的新功能),几乎是我想要的,但是我看不到如何更改元素名称( as 或 alias 在这里不起作用):
PD1:
Ok, I have used arrays_zip Spark SQL function (new in 2.4.0 version) and it's nearly what I want but I can't see how we can change the elements names (as or alias doesn't work here):
val mod_df = df.withColumn("modif_elems",
arrays_zip(
array(lit("")).as("New_field"),
col("Elems.Elem").as("Elem"),
col("Elems.Desc").alias("Desc")
)
)
mod_df.show(false)
+---------------------------+---------------------------------+
|Elems |modif_elems |
+---------------------------+---------------------------------+
|[[1, d1], [2, d2], [3, d3]]|[[, 1, d1], [, 2, d2], [, 3, d3]]|
+---------------------------+---------------------------------+
mod_df.printSchema
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
|-- modif_elems: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- 0: string (nullable = true)
| | |-- 1: integer (nullable = true)
| | |-- 2: string (nullable = true)
结构 modif_elems 应该包含3个名为 New_field , Elem 和 Desc 的元素,而不是 0 , 1 和 2 。
Struct modif_elems shoud contains 3 elements named New_field, Elem and Desc, not 0, 1 and 2.
推荐答案
解决方案此处。我们需要使用arrays_zip,然后重命名获得的列:
Solution here. We need to do use arrays_zip and then rename the obtained column:
val mod_df = df
.withColumn("modif_elems_NOT_renamed",
arrays_zip(
array(lit("")).as("New_field"),
col("Elems.Elem").as("ElemRenamed"),
col("Elems.Desc").alias("DescRenamed")
))
.withColumn("modif_elems_renamed",
$"modif_elems_NOT_renamed".cast(ArrayType(elem_struct_recomposed)))
mod_df.show(false)
mod_df.printSchema
+---------------------------+---------------------------------+---------------------------------+
|Elems |modif_elems_NOT_renamed |modif_elems_renamed |
+---------------------------+---------------------------------+---------------------------------+
|[[1, d1], [2, d2], [3, d3]]|[[, 1, d1], [, 2, d2], [, 3, d3]]|[[, 1, d1], [, 2, d2], [, 3, d3]]|
+---------------------------+---------------------------------+---------------------------------+
root
|-- Elems: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Elem: integer (nullable = true)
| | |-- Desc: string (nullable = true)
|-- modif_elems_NOT_renamed: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- 0: string (nullable = true)
| | |-- 1: integer (nullable = true)
| | |-- 2: string (nullable = true)
|-- modif_elems_renamed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- New_field: string (nullable = true)
| | |-- ElemRenamed: integer (nullable = true)
| | |-- DescRenamed: string (nullable = true)
这篇关于Spark-如何将元素添加到结构数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!