本文介绍了SparkSQL:如何处理用户定义函数中的空值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为表1提供一个类型为String的"x"列.我想用"y"列创建表2,该列是"x"中给出的日期字符串的整数表示.

Given Table 1 with one column "x" of type String.I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x".

基本是将null值保留在"y"列中.

Essential is to keep null values in column "y".

表1(数据框df1):

Table 1 (Dataframe df1):

+----------+
|         x|
+----------+
|2015-09-12|
|2015-09-13|
|      null|
|      null|
+----------+
root
 |-- x: string (nullable = true)

表2(数据框df2):

Table 2 (Dataframe df2):

+----------+--------+
|         x|       y|
+----------+--------+
|      null|    null|
|      null|    null|
|2015-09-12|20150912|
|2015-09-13|20150913|
+----------+--------+
root
 |-- x: string (nullable = true)
 |-- y: integer (nullable = true)

将将"x"列的值转换为"y"列的值的用户定义函数(udf)是:

While the user-defined function (udf) to convert values from column "x" into those of column "y" is:

val extractDateAsInt = udf[Int, String] (
  (d:String) => d.substring(0, 10)
      .filterNot( "-".toSet)
      .toInt )

并且可行,无法处理空值.

and works, dealing with null values is not possible.

尽管如此,我也可以做类似的事情

Even though, I can do something like

val extractDateAsIntWithNull = udf[Int, String] (
  (d:String) =>
    if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
    else 1 )

我找不到通过udfs产生" null值的方法(当然,因为Int不能是null).

I have found no way, to "produce" null values via udfs (of course, as Ints can not be null).

我当前创建df2的解决方案(表2)如下:

My current solution for creation of df2 (Table 2) is as follows:

// holds data of table 1
val df1 = ...

// filter entries from df1, that are not null
val dfNotNulls = df1.filter(df1("x")
  .isNotNull)
  .withColumn("y", extractDateAsInt(df1("x")))
  .withColumnRenamed("x", "right_x")

// create df2 via a left join on df1 and dfNotNull having
val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")

问题:

  • 当前解决方案似乎很麻烦(并且可能无法有效地提高性能).有更好的方法吗?
  • @ Spark-developers:是否有计划/可用的NullableInt类型,以便可以使用以下udf(请参阅代码摘录)?
  • The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?
  • @Spark-developers: Is there a type NullableInt planned / avaiable, such that the following udf is possible (see Code excerpt ) ?

代码摘录

val extractDateAsNullableInt = udf[NullableInt, String] (
  (d:String) =>
    if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
    else null )

推荐答案

这是Option派上用场的地方:

val extractDateAsOptionInt = udf((d: String) => d match {
  case null => None
  case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)
})

或在一般情况下使其更加安全:

or to make it slightly more secure in general case:

import scala.util.Try

val extractDateAsOptionInt = udf((d: String) => Try(
  d.substring(0, 10).filterNot("-".toSet).toInt
).toOption)

所有功劳归于 Dmitriy Selivanov ,他指出此解决方案是(缺少的)编辑此处.

All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here.

替代方法是在UDF之外处理null:

Alternative is to handle null outside the UDF:

import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType

val extractDateAsInt = udf(
   (d: String) => d.substring(0, 10).filterNot("-".toSet).toInt
)

df.withColumn("y",
  when($"x".isNull, lit(null))
    .otherwise(extractDateAsInt($"x"))
    .cast(IntegerType)
)

这篇关于SparkSQL:如何处理用户定义函数中的空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 19:24