Spark无法写入然后读取具有空列的JSON格式的数据

Spark无法写入然后读取具有空列的JSON格式的数据

本文介绍了Spark无法写入然后读取具有空列的JSON格式的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过一个新项目设置spark,并且我有一些案例类是从我公司其他地方的模式生成的,我想以此为模板来以多种格式(镶木地板和json)进行读/写

I'm trying to set up spark with a new project, and I have some case classes generated from schemas elsewhere in my company I want to use as a template to read/write in a variety of formats (parquet and json)

我注意到json中的一个字段存在问题,该字段是Option [String].相应的数据通常为空,但有时不为空.当我使用这些数据的子集进行测试时,该列中所有行都有空值的可能性很大. Spark似乎检测到了这一点,并为此数据留空的所有行都排除了该列.

I'm noticing an issue in json with one of our fields, which is an Option[String]. The corresponding data is usually null, but is sometimes not. When I'm testing with subsets of this data, there's a decent chance that all the rows have null in this column. Spark seems to detect that and leave out the column for any rows that have a null for this data.

在阅读时,只要任何一行都有相应的数据,spark都会选择该模式并将其转换回case类.但是,如果它们都不存在,spark会看到缺少的列并失败.

When I'm reading, as long as any single row has the corresponding data, spark picks up the schema and can translate it back to the case class just fine. But if none of them are there, spark sees a missing column and fails.

这里有一些代码演示了这一点.

Here's some code that demonstrates this.

import org.apache.spark.sql.SparkSession

object TestNulls {
  case class Test(str: Option[String])
  def main(args: Array[String]) {
    val spark: SparkSession = SparkSession
      .builder()
      .getOrCreate()
    import spark.implicits._

    val dataset = Seq(
      Test(None),
      Test(None),
      Test(None)
    ).toDS()

    // Because all rows are null, writes {} for all rows
    dataset.write.json("testpath")

    // Fails because column `test` does not exist, even though it is an option
    spark.read.json("testpath").as[Test].show()
  }
}

是否有一种方法可以告诉Spark在缺少可为空的列时不会失败?如果失败,是否可以使用人类可读的格式显示这种行为? json主要是为了让我们可以编写人类readabale文件来进行测试和本地开发案例

Is there a way to tell spark not to fail on a missing nullable column? Failing that, is there a human readable format I can use that won't exhibit this behavior? The json is mainly so that we can write human-readabale files for testing and local development cases

推荐答案

您可以使用案例类从编码器中提取模式,然后在阅读时将其传递

You can use the case class to extract the schema from the Encoder and then pass it when you read

val schema = implicitly[Encoder[Test]].schema
spark.read.schema(schema).json("testpath")

这篇关于Spark无法写入然后读取具有空列的JSON格式的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 09:49