在 Spark DataFrame 中将空值转换为空数组

本文介绍了在 Spark DataFrame 中将空值转换为空数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 Spark 数据框，其中一列是整数数组.该列可以为空，因为它来自左外连接.我想将所有空值转换为空数组，这样以后就不必处理空值了.

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.

我以为我可以这样做:

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

然而，这会导致以下异常:

However, this results in the following exception:

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

显然when 函数不支持数组类型.还有其他简单的方法可以转换空值吗?

Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?

如果相关，这里是此列的架构:

In case it is relevant, here is the schema for this column:

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

推荐答案

您可以使用 UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

与 WHEN 或 COALESCE 结合:

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

在最近的版本中你可以使用array函数:

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

请注意，只有当允许从 string 转换为所需类型时，它才会起作用.

Please note that it will work only if conversion from string to the desired type is allowed.

同样的事情当然也可以在 PySpark 中完成.对于遗留解决方案，您可以定义 udf

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

在最近的版本中只使用 array:

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

这篇关于在 Spark DataFrame 中将空值转换为空数组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！