问题描述
我有一个以下格式的DataFrame:
item_id1: Long, item_id2: Long, similarity_score: Double
我想要做的是为每个 item_id1 获取前 N 个最高的相似性_score 记录.因此,例如:
What I'm trying to do is to get top N highest similarity_score records for each item_id1.So, for example:
1 2 0.5
1 3 0.4
1 4 0.3
2 1 0.5
2 3 0.4
2 4 0.3
如果有前 2 个相似的物品会给出:
With top 2 similar items would give:
1 2 0.5
1 3 0.4
2 1 0.5
2 3 0.4
我隐约猜想可以先按item_id1对记录分组,然后按score反向排序,再限制结果.但我对如何在 Spark Scala 中实现它感到困惑.
I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.
谢谢.
推荐答案
我建议为此使用窗口函数:
I would suggest to use window-functions for this:
df
.withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
.where($"rank"<=2)
或者,您可以使用 dense_rank
/rank
而不是 row_number
,这取决于如何处理相似度得分相等的情况.
Alternatively, you could use dense_rank
/rank
instead of row_number
, depending on how to handle cases where the similarity-score is equal.
这篇关于Spark 获得每项的 top N 最高分结果(item1、item2、score)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!