在 spark 中,我有以下名为“df”的数据框,其中包含一些空条目:

|     id|           features1|           features2|
|    185|(5,[0,1,4],[0.1,0...|                null|
|    220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
|    225|                null|(10,[1,3,5],[0.1,...|

df.features1 和 df.features2 是类型向量(可为空)。然后我尝试使用以下代码用 SparseVectors 填充空条目:
df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})

AttributeError: 'SparseVector' object has no attribute '_get_object_id'

然后我在 spark 文档中找到了以下段落:
fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

这是否解释了我未能用 DataFrame 中的 SparseVectors 替换空条目?或者这是否意味着在 DataFrame 中没有办法做到这一点?

我可以通过将 DataFrame 转换为 RDD 并用 SparseVectors 替换 None 值来实现我的目标,但是直接在 DataFrame 中执行此操作会方便得多。

是否有任何方法可以直接在 DataFrame 中执行此操作?


您可以使用 udf :

from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *

fill_with_vector = udf(
    lambda x, i: x if x is not None else SparseVector(i, {}),

df = sc.parallelize([
    (SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])

    .withColumn("features1", fill_with_vector("features1", lit(5)))
    .withColumn("features2", fill_with_vector("features2", lit(10)))

# +-------------+---------------+
# |    features1|      features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# |    (5,[],[])|     (10,[],[])|
# +-------------+---------------+

关于Python Spark 数据帧 : replace null with SparseVector,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/41531108/

10-16 03:20