使用Spark ML计算PCA时发生IllegalArgumentException

本文介绍了使用Spark ML计算PCA时发生IllegalArgumentException的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 id 和 features 列的镶木地板文件，我想应用pca算法.

I have a parquet file containing the id and features columns and I want to apply the pca algorithm.

val dataset =  spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
    .setInputCols(Array("id", "features" ))
    .setOutputCol("features")
val pca = new PCA()
     .setInputCol("features")
     .setK(50)
     .fit(dataset)
     .setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")

但是我有这个例外

推荐答案

Spark的PCA转换器需要由 VectorAssembler 创建的列.在这里您创建一个，但从不使用它.另外， VectorAssembler 仅采用数字作为输入.我不知道 features 的类型是什么，但是如果它是一个数组，它将无法正常工作.首先将其转换为数字列.最后，以与原始列相同的方式命名组装后的列是个坏主意.确实， VectorAssembler 不会删除输入列，如果有两个 features 列，您将最终结束.

Spark's PCA transformer needs a column created by a VectorAssembler. Here you create one but never use it. Also, the VectorAssembler only takes numbers as input. I don't know what the type of features is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler does not remove input columns and you will end up if two features columns.

这是Spark中PCA计算的一个有效示例:

Here is a working example of PCA computation in Spark:

import org.apache.spark.ml.feature._

val df = spark.range(10)
    .select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
    .setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
    .setInputCol("features").setOutputCol("pcaFeatures").setK(2)
    .fit(assembled_df)
val result = pca.transform(assembled_df)

这篇关于使用Spark ML计算PCA时发生IllegalArgumentException的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！