问题描述
我使用 Spark Scala 来计算 Dataframe 行之间的余弦相似度.
I am using Spark Scala to calculate cosine similarity between the Dataframe rows.
数据帧格式如下
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
下面的数据框示例
+-------+--------------------+
| SKU| Features|
+-------+--------------------+
| 9970.0|[4.7143,0.0,5.785...|
|19676.0|[5.5,0.0,6.4286,4...|
| 3296.0|[4.7143,1.4286,6....|
|13658.0|[6.2857,0.7143,4....|
| 1.0|[4.2308,0.7692,5....|
| 513.0|[3.0,0.0,4.9091,5...|
| 3753.0|[5.9231,0.0,4.846...|
|14967.0|[4.5833,0.8333,5....|
| 2803.0|[4.2308,0.0,4.846...|
|11879.0|[3.1429,0.0,4.5,4...|
+-------+--------------------+
我尝试转置矩阵并检查以下提到的链接.Apache Spark Python Cosine Similarity over DataFrames,calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf 但我相信有更好的解决方案
I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution
我尝试了以下示例代码
val irm = new IndexedRowMatrix(inClusters.rdd.map {
case (v,i:Vector) => IndexedRow(v, i)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
但我收到以下错误
Error:(80, 12) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.spark.sql.Row
case (v,i:Vector) => IndexedRow(v, i)
我检查了以下链接 Apache Spark: How to create a matrix from a DataFrame?但不能用Scala来做
I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala
推荐答案
DataFrame.rdd
返回RDD[Row]
而不是RDD[(T, U)]
.您必须对Row
进行模式匹配或直接提取有趣的部分.ml
Vector
与Datasets
一起使用,因为 Spark 2.0 与mllib
Vector
由旧 API 使用.您必须将其转换为与IndexedRowMatrix
一起使用.- 索引必须是
Long
而不是字符串. DataFrame.rdd
returnsRDD[Row]
notRDD[(T, U)]
. You have to pattern match theRow
or directly extract interesting parts.ml
Vector
used withDatasets
since Spark 2.0 is not the same asmllib
Vector
use by old API. You have to convert it to use withIndexedRowMatrix
.- Index has to be
Long
not string.
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })
这篇关于计算余弦相似度 Spark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!