问题描述
我具有以下格式的 DataFrame :
item_id1: Long, item_id2: Long, similarity_score: Double
我想做的是为每个item_id1获得前N个最高的sameity_score记录.因此,例如:
What I'm trying to do is to get top N highest similarity_score records for each item_id1.So, for example:
1 2 0.5
1 3 0.4
1 4 0.3
2 1 0.5
2 3 0.4
2 4 0.3
排名前2的类似物品会给出:
With top 2 similar items would give:
1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.4
我隐约猜测这可以通过以下方式完成:首先将记录按item_id1分组,然后按分数反向排序,然后限制结果.但是我一直坚持如何在Spark Scala中实现它.
I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.
谢谢.
推荐答案
我建议为此使用窗口函数:
I would suggest to use window-functions for this:
df
.withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
.where($"rank"<=2)
或者,可以使用dense_rank
/rank
代替row_number
,具体取决于如何处理相似性分数相等的情况.
Alternatively, you could use dense_rank
/rank
instead of row_number
, depending on how to handle cases where the similarity-score is equal.
这篇关于Spark分别获得最高的N个最高得分结果(项目1,项目2,得分)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!