问题描述
我有一个包含 id
和 features
列的镶木地板文件,我想应用pca算法.
I have a parquet file containing the id
and features
columns and I want to apply the pca algorithm.
val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user")
val features = new VectorAssembler()
.setInputCols(Array("id", "features" ))
.setOutputCol("features")
val pca = new PCA()
.setInputCol("features")
.setK(50)
.fit(dataset)
.setOutputCol("pcaFeatures")
val result = pca.transform(dataset).select("pcaFeatures")
pca.save("/usr/local/spark/dataset/out")
但是我有这个例外
推荐答案
Spark的PCA转换器需要由 VectorAssembler
创建的列.在这里您创建一个,但从不使用它.另外, VectorAssembler
仅采用数字作为输入.我不知道 features
的类型是什么,但是如果它是一个数组,它将无法正常工作.首先将其转换为数字列.最后,以与原始列相同的方式命名组装后的列是个坏主意.确实, VectorAssembler
不会删除输入列,如果有两个 features
列,您将最终结束.
Spark's PCA transformer needs a column created by a VectorAssembler
. Here you create one but never use it. Also, the VectorAssembler
only takes numbers as input. I don't know what the type of features
is, but if it's an array, it won't work. Transform it into numeric columns first. Finally, it is a bad idea to name the assembled column the same way as an original column. Indeed, the VectorAssembler
does not remove input columns and you will end up if two features
columns.
这是Spark中PCA计算的一个有效示例:
Here is a working example of PCA computation in Spark:
import org.apache.spark.ml.feature._
val df = spark.range(10)
.select('id, ('id * 'id) as "id2", ('id * 'id * 'id) as "id3")
val assembler = new VectorAssembler()
.setInputCols(Array("id", "id2", "id3")).setOutputCol("features")
val assembled_df = assembler.transform(df)
val pca = new PCA()
.setInputCol("features").setOutputCol("pcaFeatures").setK(2)
.fit(assembled_df)
val result = pca.transform(assembled_df)
这篇关于使用Spark ML计算PCA时发生IllegalArgumentException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!