问题描述
我的Spark DataFrame中有一列:
I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
我正在使用CountVectorizer:
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
我得到NullPointerExceptions,因为有时topic_A列包含空值.
I get NullPointerExceptions, because sometimes the topic_A column contains null.
有没有解决的办法?用零长度的数组填充它可以正常工作(尽管它会消耗很多数据量)-但我不知道如何在PySpark的Array列上执行fillNa.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
推荐答案
我个人会使用NULL
值删除列,因为那里没有有用的信息,但是您可以将空值替换为空数组.首先是一些进口:
Personally I would drop columns with NULL
values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
您可以将特定类型的空数组定义为:
You can define an empty array of specific type as:
fill = array().cast("array<string>")
并将其与when
子句组合:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
或coalesce
:
topics_a = coalesce(col("topics_A"), fill)
并将其用作:
df.withColumn("topics_A", topics_a)
因此带有示例数据:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
结果将是:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+
这篇关于如何在可能为空的列上使用PySpark CountVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!