本文介绍了计算余弦相似度 Spark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我使用 Spark Scala 来计算 Dataframe 行之间的余弦相似度.

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.


    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)


    |    SKU|            Features|
    | 9970.0|[4.7143,0.0,5.785...|
    | 3296.0|[4.7143,1.4286,6....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    | 2803.0|[4.2308,0.0,4.846...|

我尝试转置矩阵并检查以下提到的链接.Apache Spark Python Cosine Similarity over DataFramescalculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf 但我相信有更好的解决方案

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution


val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)



Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

我检查了以下链接 Apache Spark: How to create a matrix from a DataFrame?但不能用Scala来做

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala


  • DataFrame.rdd 返回 RDD[Row] 而不是 RDD[(T, U)].您必须对 Row 进行模式匹配或直接提取有趣的部分.
  • ml VectorDatasets 一起使用,因为 Spark 2.0 与 mllib Vector 由旧 API 使用.您必须将其转换为与 IndexedRowMatrix 一起使用.
  • 索引必须是 Long 而不是字符串.
    • DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
    • ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
    • Index has to be Long not string.
    • import org.apache.spark.sql.Row
      val irm = new IndexedRowMatrix(inClusters.rdd.map {
        Row(_, v: org.apache.spark.ml.linalg.Vector) =>
      }.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

      这篇关于计算余弦相似度 Spark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:43