计算余弦相似度Spark数据帧

本文介绍了计算余弦相似度Spark数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Spark Scala计算数据帧行之间的余弦相似度.

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

数据框格式如下

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)

下面的数据框示例

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+

我试图转置矩阵并检查以下提到的链接.数据框架上的Apache Spark Python余弦相似度>数据框架上的Apache Spark Python余弦相似度，，但我相信有更好的解决方案

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

我尝试了以下示例代码

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

但是我遇到了以下错误

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

我检查了以下链接 Apache Spark:如何从DataFrame创建矩阵?但是不能使用Scala做到这一点

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

推荐答案

DataFrame.rdd返回RDD[Row]而不是RDD[(T, U)].您必须对Row进行图案匹配或直接提取有趣的部分.
ml Vector与Datasets一起使用，因为Spark 2.0与旧API使用的mllib Vector不同.您必须将其转换为与IndexedRowMatrix一起使用.
索引必须为Long而不是字符串.

DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
Index has to be Long not string.

import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) =>
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

这篇关于计算余弦相似度Spark数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！