本文介绍了Pyspark KMeans 聚类特征列 IllegalArgumentException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pyspark==2.4.0

这是给出异常的代码:

LDA = spark.read.parquet('./LDA.parquet/')
LDA.printSchema()

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1)
model = kmeans.fit(LDA)


|-- Id: string (nullable = true)
|-- topic_vector_fix_dim: 数组 (nullable = true)
||-- 元素:double (containsNull = true)

root
|-- Id: string (nullable = true)
|-- topic_vector_fix_dim: array (nullable = true)
| |-- element: double (containsNull = true)

IllegalArgumentException:'要求失败:列 topic_vector_fix_dim 的类型必须等于以下类型之一:[struct ,values:array <double > >, array <double >, 数组 ] 但实际上是 array < 类型双 > .'

IllegalArgumentException:'requirement failed: Column topic_vector_fix_dim must be of type equal to one of the following types: [struct < type:tinyint,size:int,indices:array < int >,values:array < double > >, array < double >, array < float > ] but was actually of type array < double > .'

我很困惑 - 它不喜欢我的 array <double>,但说它可能是输入.
topic_vector_fix_dim 的每个条目都是一个一维浮点数组

I am confused - it does not like my array <double>, but says that it may be the input.
Each entry of the topic_vector_fix_dim is a 1d array of floats

推荐答案

containsNull 的 features 列应设置为 False:

containsNull of the features column should be set to False:

new_schema = ArrayType(DoubleType(), containsNull=False)
udf_foo = udf(lambda x:x, new_schema)
LDA = LDA.withColumn("topic_vector_fix_dim",udf_foo("topic_vector_fix_dim"))

之后一切正常.

这篇关于Pyspark KMeans 聚类特征列 IllegalArgumentException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:44