本文介绍了如何在 Spark Scala 中将空 NAN 或无限值替换为默认值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将 csvs 读入 Spark,并将架构设置为所有 DecimalType(10,0) 列.当我查询数据时,出现以下错误:

I'm reading in csvs into Spark and I'm setting the schema to all DecimalType(10,0) columns. When I query the data, I get the following error:

NumberFormatException: Infinite or NaN

如果我的数据框中有 NaN/null/infinite 值,我想将它们设置为 0.我该怎么做?这是我尝试加载数据的方式:

If I have NaN/null/infinite values in my dataframe, I would like to set them to 0. How do I do this? This is how I'm attempting to load the data:

var cases = spark.read.option("header",false).
option("nanValue","0").
option("nullValue","0").
option("positiveInf","0").
option("negativeInf","0").
schema(schema).
csv(...

任何帮助将不胜感激.

推荐答案

如果多列有 NaN 值,可以使用 na.fill() 来填充使用默认值

If you have NaN values in multiple columns, you can use na.fill() to fill with the default value

示例:

  val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()

  import spark.implicits._

  val data = spark.sparkContext.parallelize(
    Seq((0f,0f, "2016-01-1"),
        (1f,1f, "2016-02-2"),
        (2f,2f, "2016-03-21"),
        (Float.NaN,Float.NaN, "2016-04-25"),
        (4f,4f, "2016-05-21"),
        (Float.NaN,Float.NaN, "2016-06-1"),
        (6f,6f, "2016-03-21"))
  ).toDF("id1", "id", "date")

data.na.fill(0).show
+---+---+----------+
|id1| id|      date|
+---+---+----------+
|0.0|0.0| 2016-01-1|
|1.0|1.0| 2016-02-2|
|2.0|2.0|      null|
|0.0|0.0|2016-04-25|
|4.0|4.0|2016-05-21|
|0.0|0.0| 2016-06-1|
|6.0|6.0|2016-03-21|
+---+---+----------+

这篇关于如何在 Spark Scala 中将空 NAN 或无限值替换为默认值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-25 09:02