Spark，Scala，DataFrame:创建特征向量

本文介绍了Spark，Scala，DataFrame:创建特征向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个DataFrame，如下所示:

userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2

不同类别的数量为10，我想为每个userID创建一个特征向量，并用零填充缺失的类别.

The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.

所以输出将类似于:

userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]

这只是一个示例，实际上我有大约200,000个唯一的用户ID和300个唯一的类别.

It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.

创建功能DataFrame的最有效方法是什么?

What is the most efficient way to create the features DataFrame?

推荐答案

假设:

val cs: SparkContext
val sc: SQLContext
val cats: DataFrame

其中userId和frequency是bigint列，对应于scala.Long

我们正在创建中间映射RDD:

We are creating intermediate mapping RDD:

val catMaps = cats.rdd
  .groupBy(_.getAs[Long]("userId"))
  .map { case (id, rows) => id -> rows
    .map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
    .toMap
  }

然后按照字典顺序收集所有显示的类别

Then collecting all presented categories in the lexicographic order

val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)

或手动创建

val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)

最后，我们将映射转换为不存在值的0值数组

Finally we're transforming maps to arrays with 0-values for non-existing values

import sc.implicits._
val catArrays = catMaps
      .map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
      .toDF("userId", "feature")

现在catArrays.show()打印类似

+------+--------------------+
|userId|             feature|
+------+--------------------+
|     2|[0, 1, 0, 6, 0, 0...|
|     1|[1, 0, 3, 0, 0, 0...|
|     3|[5, 0, 0, 0, 16, ...|
+------+--------------------+

这可能不是最适合数据帧的解决方案，因为我几乎不熟悉这方面的内容.

This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.

请注意，您可以手动创建catNames，为缺少的cat3，cat5，...

Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...

还请注意，否则catMaps RDD操作两次，您可能需要.persist()

Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it

这篇关于Spark，Scala，DataFrame:创建特征向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Missing