问题描述
我有一个DataFrame
,如下所示:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
不同类别的数量为10,我想为每个userID
创建一个特征向量,并用零填充缺失的类别.
The number of distinct categories is 10, and I would like to create a feature vector for each userID
and fill the missing categories with zeros.
所以输出将类似于:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
这只是一个示例,实际上我有大约200,000个唯一的用户ID和300个唯一的类别.
It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.
创建功能DataFrame
的最有效方法是什么?
What is the most efficient way to create the features DataFrame
?
推荐答案
假设:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
其中userId
和frequency
是bigint
列,对应于scala.Long
我们正在创建中间映射RDD
:
We are creating intermediate mapping RDD
:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
然后按照字典顺序收集所有显示的类别
Then collecting all presented categories in the lexicographic order
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
或手动创建
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
最后,我们将映射转换为不存在值的0值数组
Finally we're transforming maps to arrays with 0-values for non-existing values
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
现在catArrays.show()
打印类似
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
这可能不是最适合数据帧的解决方案,因为我几乎不熟悉这方面的内容.
This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.
请注意,您可以手动创建catNames
,为缺少的cat3
,cat5
,...
Note, that you could create your catNames
manually to add zeros for missing cat3
, cat5
, ...
还请注意,否则catMaps
RDD操作两次,您可能需要.persist()
Also note that otherwise catMaps
RDD is operated twice, you might want to .persist()
it
这篇关于Spark,Scala,DataFrame:创建特征向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!