计算余弦相似度 Spark 数据框

本文介绍了计算余弦相似度 Spark 数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 Spark Scala 来计算 Dataframe 行之间的余弦相似度.

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

数据帧格式如下

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)

下面的数据框示例

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+

我尝试转置矩阵并检查以下提到的链接.Apache Spark Python Cosine Similarity over DataFrames，calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf 但我相信有更好的解决方案

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

我尝试了以下示例代码

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

但我收到以下错误

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

我检查了以下链接 Apache Spark: How to create a matrix from a DataFrame?但不能用Scala来做

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

推荐答案

DataFrame.rdd 返回 RDD[Row] 而不是 RDD[(T, U)].您必须对 Row 进行模式匹配或直接提取有趣的部分.
ml Vector 与 Datasets 一起使用，因为 Spark 2.0 与 mllib Vector 由旧 API 使用.您必须将其转换为与 IndexedRowMatrix 一起使用.
索引必须是 Long 而不是字符串.

DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
Index has to be Long not string.

import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) =>
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

这篇关于计算余弦相似度 Spark 数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！