问题描述
我有一个 Spark 数据框,其中一列是整数数组.该列可以为空,因为它来自左外连接.我想将所有空值转换为空数组,这样以后就不必处理空值了.
I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.
我以为我可以这样做:
val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )
然而,这会导致以下异常:
However, this results in the following exception:
java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)
显然when
函数不支持数组类型.还有其他简单的方法可以转换空值吗?
Apparently array types are not supported by the when
function. Is there some other easy way to convert the null values?
如果相关,这里是此列的架构:
In case it is relevant, here is the schema for this column:
|-- myCol: array (nullable = true)
| |-- element: integer (containsNull = false)
推荐答案
您可以使用 UDF:
import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[Int])
与 WHEN
或 COALESCE
结合:
df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show
在最近的版本中你可以使用array
函数:
import org.apache.spark.sql.functions.{array, lit}
df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show
请注意,只有当允许从 string
转换为所需类型时,它才会起作用.
Please note that it will work only if conversion from string
to the desired type is allowed.
同样的事情当然也可以在 PySpark 中完成.对于遗留解决方案,您可以定义 udf
The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
def empty_array(t):
return udf(lambda: [], ArrayType(t()))()
coalesce(myCol, empty_array(IntegerType()))
在最近的版本中只使用 array
:
and in the recent versions just use array
:
from pyspark.sql.functions import array
coalesce(myCol, array().cast("array<integer>"))
这篇关于在 Spark DataFrame 中将空值转换为空数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!