本文介绍了为什么列在 Apache Spark SQL 中更改为可为空?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么即使 DataFrame 中没有 NaN 值,在执行某些函数后仍使用 nullable = true.

Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame.

val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Int"))

myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show

df.printSchema 现在被调用时,nullable 对于两列都将是 false.

When df.printSchema is called now nullable will be false for both columns.

val foo: (Int => String) = (t: Int) => {
    fooMap.get(t) match {
      case Some(tt) => tt
      case None => "notFound"
    }
  }

val fooMap = Map(
    1 -> "small",
    2 -> "big"
 )
val fooUDF = udf(foo)

myDf
    .withColumn("foo", fooUDF(col("foo")))
    .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
    .select("foo", "foo_2")
    .printSchema

但是现在,nullable 对于至少一列是 true,而之前是 false.这怎么解释?

However now, nullable is true for at least one column which was false before. How can this be explained?

推荐答案

当从静态类型结构(不依赖于 schema 参数)创建 Dataset 时,Spark 使用了一个相对简单的一组规则来确定 nullable 属性.

When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.

  • 如果给定类型的对象可以是 null,那么它的 DataFrame 表示是 nullable.
  • 如果对象是一个 Option[_] 那么它的 DataFrame 表示是 nullable 并且 None 被认为是为 SQL NULL.
  • 在任何其他情况下,它将被标记为不可nullable.
  • If object of the given type can be null then its DataFrame representation is nullable.
  • If object is an Option[_] then then its DataFrame representation is nullable with None considered to be SQL NULL.
  • In any other case it will be marked as not nullable.

由于Scala的Stringjava.lang.String,可以是null,所以生成的列可以是nullable.出于同样的原因,bar 列在初始数据集中是 nullable:

Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:

val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true

但是 foo 不是(scala.Int 不能是 null).

but foo is not (scala.Int cannot be null).

df1.schema("foo").nullable
Boolean = false

如果我们将数据定义改为:

If we change data definition to:

val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))

foo 将是 nullable (Integerjava.lang.Integer 并且装箱的整数可以是 null):

foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):

data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true

另见:SPARK-20668 修改 ScalaUDF 以处理可空性.

这篇关于为什么列在 Apache Spark SQL 中更改为可为空?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:59