问题描述
即使在DataFrame
中没有NaN值,执行某些功能后为什么还要使用nullable = true
.
Why is nullable = true
used after some functions are executed even though there are no NaN values in the DataFrame
.
val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
现在调用df.printSchema
时,两列的nullable
都将为false
.
When df.printSchema
is called now nullable
will be false
for both columns.
val foo: (Int => String) = (t: Int) => {
fooMap.get(t) match {
case Some(tt) => tt
case None => "notFound"
}
}
val fooMap = Map(
1 -> "small",
2 -> "big"
)
val fooUDF = udf(foo)
myDf
.withColumn("foo", fooUDF(col("foo")))
.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
.select("foo", "foo_2")
.printSchema
但是,对于至少一个以前为false
的列,nullable
为true
.怎么解释?
However now, nullable
is true
for at least one column which was false
before. How can this be explained?
推荐答案
从静态类型的结构创建Dataset
时(不依赖于schema
参数),Spark使用一组相对简单的规则来确定nullable
属性.
When creating Dataset
from statically typed structure (without depending on schema
argument) Spark uses a relatively simple set of rules to determine nullable
property.
- 如果给定类型的对象可以是
null
,则其DataFrame
表示形式是nullable
. - 如果object是
Option[_]
,则其DataFrame
表示形式是nullable
,其中None
被认为是SQLNULL
. - 在任何其他情况下,它将被标记为非
nullable
.
- If object of the given type can be
null
then itsDataFrame
representation isnullable
. - If object is an
Option[_]
then then itsDataFrame
representation isnullable
withNone
considered to be SQLNULL
. - In any other case it will be marked as not
nullable
.
由于Scala String
是java.lang.String
(可以是null
),因此生成的列can是nullable
.出于相同的原因,初始数据集中的bar
列为nullable
:
Since Scala String
is java.lang.String
, which can be null
, generated column can is nullable
. For the same reason bar
column is nullable
in the initial dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true
但foo
不是(scala.Int
不能是null
).
df1.schema("foo").nullable
Boolean = false
如果我们将数据定义更改为:
If we change data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo
将是nullable
(Integer
是java.lang.Integer
,而装箱的整数可以是null
):
foo
will be nullable
(Integer
is java.lang.Integer
and boxed integer can be null
):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
另请参阅: SPARK-20668 修改ScalaUDF以处理可为空性.
这篇关于为什么在Apache Spark SQL中列更改为可为空?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!