问题描述
当我尝试将df2馈送到kmeans时,出现以下错误
when i try to feed df2 to kmeans i get the following error
clusters = KMeans.train(df2, 10, maxIterations=30,
runs=10, initializationMode="random")
我得到的错误:
Cannot convert type <class 'pyspark.sql.types.Row'> into Vector
df2是如下创建的数据框:
df2 is a dataframe created as follow:
df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')
df2.show()
latitude| longitude|
60.1643075| 24.9460844|
60.4686748| 22.2774728|
我如何将这两列转换为Vector并将其提供给KMeans?
how can i convert this two columns to Vector and feed it to KMeans?
推荐答案
ML
问题是您错过了文档示例,很明显,方法train
需要具有Vector
功能的DataFrame
.
ML
The problem is that you missed the documentation's example, and it's pretty clear that the method train
requires a DataFrame
with a Vector
as features.
要修改当前数据的结构,可以使用 VectorAssembler .在您的情况下,可能是这样的:
To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:
from pyspark.sql.functions import *
vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
outputCol="features")
# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c)
for c in vectorAssembler.getInputCols()]
df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)
此外,您还应该使用类 MinMaxScaler 以获得更好的结果.
Besides, you should also normalize your features
using the class MinMaxScaler to obtain better results.
为了使用MLLib
实现此目的,您需要首先使用map函数,将所有string
值转换为Double
,然后将它们合并为 DenseVector .
In order to achieve this using MLLib
you need to use a map function first, to convert all your string
values into Double
, and merge them together in a DenseVector.
rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))
此刻之后,您可以训练您的使用rdd
变量的MLlib的KMeans模型.
After this point you can train your MLlib's KMeans model using the rdd
variable.
这篇关于如何将Row类型转换为Vector以馈给KMeans的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!