如何将Row类型转换为Vector以馈给KMeans

本文介绍了如何将Row类型转换为Vector以馈给KMeans的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我尝试将df2馈送到kmeans时，出现以下错误

when i try to feed df2 to kmeans i get the following error

clusters = KMeans.train(df2, 10, maxIterations=30,
                        runs=10, initializationMode="random")

我得到的错误:

Cannot convert type <class 'pyspark.sql.types.Row'> into Vector

df2是如下创建的数据框:

df2 is a dataframe created as follow:

df = sqlContext.read.json("data/ALS3.json")
df2 = df.select('latitude','longitude')

df2.show()


     latitude|       longitude|

   60.1643075|      24.9460844|
   60.4686748|      22.2774728|

我如何将这两列转换为Vector并将其提供给KMeans?

how can i convert this two columns to Vector and feed it to KMeans?

ML

问题是您错过了文档示例，很明显，方法train需要具有Vector功能的DataFrame.

ML

The problem is that you missed the documentation's example, and it's pretty clear that the method train requires a DataFrame with a Vector as features.

要修改当前数据的结构，可以使用 VectorAssembler .在您的情况下，可能是这样的:

To modify your current data's structure you can use a VectorAssembler. In your case it could be something like:

from pyspark.sql.functions import *

vectorAssembler = VectorAssembler(inputCols=["latitude", "longitude"],
                                  outputCol="features")

# For your special case that has string instead of doubles you should cast them first.
expr = [col(c).cast("Double").alias(c) 
        for c in vectorAssembler.getInputCols()]

df2 = df2.select(*expr)
df = vectorAssembler.transform(df2)

此外，您还应该使用类 MinMaxScaler 以获得更好的结果.

Besides, you should also normalize your features using the class MinMaxScaler to obtain better results.

为了使用MLLib实现此目的，您需要首先使用map函数，将所有string值转换为Double，然后将它们合并为 DenseVector .

In order to achieve this using MLLib you need to use a map function first, to convert all your string values into Double, and merge them together in a DenseVector.

rdd = df2.map(lambda data: Vectors.dense([float(c) for c in data]))

此刻之后，您可以训练您的使用rdd变量的MLlib的KMeans模型.

After this point you can train your MLlib's KMeans model using the rdd variable.

这篇关于如何将Row类型转换为Vector以馈给KMeans的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

kmeans

如何将Row类型转换为Vector以馈给KMeans

问题描述

推荐答案

ML

ML