Spark分别获得最高的N个最高得分结果(项目1，项目2，得分)

本文介绍了Spark分别获得最高的N个最高得分结果(项目1，项目2，得分)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我具有以下格式的 DataFrame :

item_id1: Long, item_id2: Long, similarity_score: Double

我想做的是为每个item_id1获得前N个最高的sameity_score记录.因此，例如:

What I'm trying to do is to get top N highest similarity_score records for each item_id1.So, for example:

排名前2的类似物品会给出:

With top 2 similar items would give:

我隐约猜测这可以通过以下方式完成:首先将记录按item_id1分组，然后按分数反向排序，然后限制结果.但是我一直坚持如何在Spark Scala中实现它.

I vaguely guess that it can be done by first grouping records by item_id1, then sorting in reverse by score and then limiting the results. But I'm stuck with how to implement it in Spark Scala.

谢谢.

推荐答案

我建议为此使用窗口函数:

I would suggest to use window-functions for this:

 df
  .withColumn("rnk",row_number().over(Window.partitionBy($"item_id1").orderBy($"similarity_score")))
  .where($"rank"<=2)

或者，可以使用dense_rank/rank代替row_number，具体取决于如何处理相似性分数相等的情况.

Alternatively, you could use dense_rank/rank instead of row_number, depending on how to handle cases where the similarity-score is equal.

这篇关于Spark分别获得最高的N个最高得分结果(项目1，项目2，得分)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！