将空值转换为Spark

将空值转换为Spark

本文介绍了将空值转换为Spark DataFrame中的空数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark数据帧,其中一列是整数数组.该列可为空,因为它来自左外部联接.我想将所有null值都转换为一个空数组,这样以后就不必再处理null了.

I have a Spark data frame where one column is an array of integers. The column is nullable because it is coming from a left outer join. I want to convert all null values to an empty array so I don't have to deal with nulls later.

我认为我可以这样做:

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

但是,这导致以下异常:

However, this results in the following exception:

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

when函数显然不支持数组类型.还有其他简单的方法可以转换为空值吗?

Apparently array types are not supported by the when function. Is there some other easy way to convert the null values?

在相关的情况下,这是此列的架构:

In case it is relevant, here is the schema for this column:

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

推荐答案

您可以使用UDF:

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

WHENCOALESCE组合:

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

最新版本中,您可以使用array函数:

In the recent versions you can use array function:

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

请注意,只有在允许从string转换为所需类型的情况下,它才有效.

Please note that it will work only if conversion from string to the desired type is allowed.

当然也可以在PySpark中完成相同的操作.对于旧式解决方案,您可以定义udf

The same thing can be of course done in PySpark as well. For the legacy solutions you can define udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

,在最新版本中,只需使用array:

and in the recent versions just use array:

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

这篇关于将空值转换为Spark DataFrame中的空数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 13:17