问题描述
给定两个案例类:
case class Response(
responseField: String
...
items: List[Item])
case class Item(
itemField: String
...)
我正在创建一个 Response
数据集:
I am creating a Response
dataset:
val dataset = spark.read.format("parquet")
.load(inputPath)
.as[Response]
.map(x => x)
当 itemField
不存在于任何行中时会出现问题,并且 spark 将引发此错误 org.apache.spark.sql.AnalysisException: No such struct field itemField
.如果 itemField
没有嵌套,我可以通过执行 dataset.withColumn("itemField", lit(""))
来处理它.是否可以在 List
字段中执行相同的操作?
The issue arises when itemField
is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField
. If itemField
was not nested I could handle it by doing dataset.withColumn("itemField", lit(""))
. Is it possible to do the same within the List
field?
推荐答案
我假设如下:
数据是用以下架构写入的:
Data was written with the following schema:
case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")
现在架构更改为:
case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails
对于 Spark 2.4,请参阅:Spark - 如何向数组添加元素结构体
For Spark 2.4 please see:Spark - How to add an element to an array of structs
对于 Spark 2.3,这应该有效:
For Spark 2.3 this should work:
val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }
val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
.withColumn("items", addNewFieldUdf(
col("items.itemField") as "itemField",
array(lit(1)) as "newField"
)).as[Response].map(x => x) // Works
这篇关于如何处理spark中丢失的嵌套字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!