本文介绍了计算余弦相似度Spark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在使用Spark Scala计算数据帧行之间的余弦相似度.
I am using Spark Scala to calculate cosine similarity between the Dataframe rows.
数据框格式如下
root
|-- SKU: double (nullable = true)
|-- Features: vector (nullable = true)
下面的数据框示例
+-------+--------------------+
| SKU| Features|
+-------+--------------------+
| 9970.0|[4.7143,0.0,5.785...|
|19676.0|[5.5,0.0,6.4286,4...|
| 3296.0|[4.7143,1.4286,6....|
|13658.0|[6.2857,0.7143,4....|
| 1.0|[4.2308,0.7692,5....|
| 513.0|[3.0,0.0,4.9091,5...|
| 3753.0|[5.9231,0.0,4.846...|
|14967.0|[4.5833,0.8333,5....|
| 2803.0|[4.2308,0.0,4.846...|
|11879.0|[3.1429,0.0,4.5,4...|
+-------+--------------------+
我试图转置矩阵并检查以下提到的链接.数据框架上的Apache Spark Python余弦相似度>数据框架上的Apache Spark Python余弦相似度,,但我相信有更好的解决方案
I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution
我尝试了以下示例代码
val irm = new IndexedRowMatrix(inClusters.rdd.map {
case (v,i:Vector) => IndexedRow(v, i)
}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities
但是我遇到了以下错误
Error:(80, 12) constructor cannot be instantiated to expected type;
found : (T1, T2)
required: org.apache.spark.sql.Row
case (v,i:Vector) => IndexedRow(v, i)
我检查了以下链接 Apache Spark:如何从DataFrame创建矩阵?但是不能使用Scala做到这一点
I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala
推荐答案
-
DataFrame.rdd
返回RDD[Row]
而不是RDD[(T, U)]
.您必须对Row
进行图案匹配或直接提取有趣的部分. -
ml
Vector
与Datasets
一起使用,因为Spark 2.0与旧API使用的mllib
Vector
不同.您必须将其转换为与IndexedRowMatrix
一起使用. - 索引必须为
Long
而不是字符串. DataFrame.rdd
returnsRDD[Row]
notRDD[(T, U)]
. You have to pattern match theRow
or directly extract interesting parts.ml
Vector
used withDatasets
since Spark 2.0 is not the same asmllib
Vector
use by old API. You have to convert it to use withIndexedRowMatrix
.- Index has to be
Long
not string.
import org.apache.spark.sql.Row
val irm = new IndexedRowMatrix(inClusters.rdd.map {
Row(_, v: org.apache.spark.ml.linalg.Vector) =>
org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })
这篇关于计算余弦相似度Spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!