如何处理spark中丢失的嵌套字段

如何处理spark中丢失的嵌套字段

本文介绍了如何处理spark中丢失的嵌套字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定两个案例类:

case class Response(
  responseField: String
  ...
  items: List[Item])

case class Item(
  itemField: String
  ...)

我正在创建一个 Response 数据集:

I am creating a Response dataset:

val dataset = spark.read.format("parquet")
                .load(inputPath)
                .as[Response]
                .map(x => x)

itemField 不存在于任何行中时会出现问题,并且 spark 将引发此错误 org.apache.spark.sql.AnalysisException: No such struct field itemField.如果 itemField 没有嵌套,我可以通过执行 dataset.withColumn("itemField", lit("")) 来处理它.是否可以在 List 字段中执行相同的操作?

The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField. If itemField was not nested I could handle it by doing dataset.withColumn("itemField", lit("")). Is it possible to do the same within the List field?

推荐答案

我假设如下:

数据是用以下架构写入的:

Data was written with the following schema:

case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")

现在架构更改为:

case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails

对于 Spark 2.4,请参阅:Spark - 如何向数组添加元素结构体

For Spark 2.4 please see:Spark - How to add an element to an array of structs

对于 Spark 2.3,这应该有效:

For Spark 2.3 this should work:

val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }

val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
   .withColumn("items", addNewFieldUdf(
      col("items.itemField") as "itemField",
      array(lit(1)) as "newField"
   )).as[Response].map(x => x) // Works

这篇关于如何处理spark中丢失的嵌套字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-21 13:56