我有一个 Spark 数据框,其中一列是整数数组.该列可以为空,因为它来自左外连接.我想将所有空值转换为空数组,这样以后就不必处理空值了.

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.


val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )


However, this results in the following exception:

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

显然when 函数不支持数组类型.还有其他简单的方法可以转换空值吗?

Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?


In case it is relevant, here is the schema for this column:

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)


您可以使用 UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])


df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show


import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

请注意,只有当允许从 string 转换为所需类型时,它才会起作用.

Please note that it will work only if conversion from string to the desired type is allowed.

同样的事情当然也可以在 PySpark 中完成.对于遗留解决方案,您可以定义 udf

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

在最近的版本中只使用 array:

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

