本文介绍了在 columnSimilarties() Spark scala 之后获取列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 spark 中的 columnSimilarities() 构建基于项目的协同过滤模型.使用 columnsSimilarities() 后,我想将原始列名分配回 Spark scala 中的结果.

I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala.

在数据框上计算 columnSimilarities() 的可运行代码.

Runnable code to calculate columnSimilarities() on data frame.

数据

// rdd
val rowsRdd: RDD[Row] = sc.parallelize(
  Seq(
    Row(2.0, 7.0, 1.0),
    Row(3.5, 2.5, 0.0),
    Row(7.0, 5.9, 0.0)
  )
)

// Schema
val schema = new StructType()
  .add(StructField("item_1", DoubleType, true))
  .add(StructField("item_2", DoubleType, true))
  .add(StructField("item_3", DoubleType, true))

// Data frame
val df = spark.createDataFrame(rowsRdd, schema)

在该数据框上计算 columnSimilarities():

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}

val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd

val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
                             .map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()


simsPerfect.entries.collect.mkString(", ")

输出:

res0: String = MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)

我必须从列中获取原始名称而不是该向量中的位置.

I've to get the original names from columns instead of the position in that vector.

我尝试从 df 中读取列名:

I tried to read the column names from df with:

val names = df.columns

我的想法是将名称与该向量中的位置匹配,应该按相同的顺序,但我不知道如何使用余弦相似度将名称附加回该向量.

and my idea was to match the names with the positions in that vector wich should be in the same order, but I don't know how to attach the names back into that vector with the cosineSimilarities.

我很乐意提供任何建议!!

I'm happy for any advice!!

推荐答案

提取列名(这是这里的棘手部分,因为它无法在闭包中计算):

Extract columns names (this is the tricky part here because it cannot be evaluated in the closure):

val names = df.columns

map 条目:

simsPerfect.entries.map {
  case MatrixEntry(i, j, v)  => (names(i.toInt),  names(j.toInt), v)
}

这篇关于在 columnSimilarties() Spark scala 之后获取列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 04:45