问题描述
为什么即使 DataFrame
中没有 NaN 值,在执行某些函数后仍使用 nullable = true
.
Why is nullable = true
used after some functions are executed even though there are no NaN values in the DataFrame
.
val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
当 df.printSchema
现在被调用时,nullable
对于两列都将是 false
.
When df.printSchema
is called now nullable
will be false
for both columns.
val foo: (Int => String) = (t: Int) => {
fooMap.get(t) match {
case Some(tt) => tt
case None => "notFound"
}
}
val fooMap = Map(
1 -> "small",
2 -> "big"
)
val fooUDF = udf(foo)
myDf
.withColumn("foo", fooUDF(col("foo")))
.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
.select("foo", "foo_2")
.printSchema
但是现在,nullable
对于至少一列是 true
,而之前是 false
.这怎么解释?
However now, nullable
is true
for at least one column which was false
before. How can this be explained?
推荐答案
当从静态类型结构(不依赖于 schema
参数)创建 Dataset
时,Spark 使用了一个相对简单的一组规则来确定 nullable
属性.
When creating Dataset
from statically typed structure (without depending on schema
argument) Spark uses a relatively simple set of rules to determine nullable
property.
- 如果给定类型的对象可以是
null
,那么它的DataFrame
表示是nullable
. - 如果对象是一个
Option[_]
那么它的DataFrame
表示是nullable
并且None
被认为是为 SQLNULL
. - 在任何其他情况下,它将被标记为不可
nullable
.
- If object of the given type can be
null
then itsDataFrame
representation isnullable
. - If object is an
Option[_]
then then itsDataFrame
representation isnullable
withNone
considered to be SQLNULL
. - In any other case it will be marked as not
nullable
.
由于Scala的String
是java.lang.String
,可以是null
,所以生成的列可以是nullable
.出于同样的原因,bar
列在初始数据集中是 nullable
:
Since Scala String
is java.lang.String
, which can be null
, generated column can is nullable
. For the same reason bar
column is nullable
in the initial dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true
但是 foo
不是(scala.Int
不能是 null
).
but foo
is not (scala.Int
cannot be null
).
df1.schema("foo").nullable
Boolean = false
如果我们将数据定义改为:
If we change data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo
将是 nullable
(Integer
是 java.lang.Integer
并且装箱的整数可以是 null
):
foo
will be nullable
(Integer
is java.lang.Integer
and boxed integer can be null
):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
另见:SPARK-20668 修改 ScalaUDF 以处理可空性.
这篇关于为什么列在 Apache Spark SQL 中更改为可为空?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!