问题描述
我几乎可以肯定以前有人问过这个问题,但是 我的问题没有通过堆栈搜索> relano> relano>不是 [2] 的副本,因为我想要最大值值,不是最频繁的项目.我是 pyspark 的新手,并试图做一些非常简单的事情:我想对A"列进行分组,然后只保留B"列中具有最大值的每个组的行.像这样:
I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really simple: I want to groupBy column "A" and then only keep the row of each group that has the maximum value in column "B". Like this:
df_cleaned = df.groupBy("A").agg(F.max("B"))
不幸的是,这会丢弃所有其他列 - df_cleaned 仅包含列A"和 B 的最大值.我该如何保留行?(A"、B"、C"...)
Unfortunately, this throws away all other columns - df_cleaned only contains the columns "A" and the max value of B. How do I instead keep the rows? ("A", "B", "C"...)
推荐答案
您可以使用 Window
在没有 udf
的情况下执行此操作.
You can do this without a udf
using a Window
.
考虑以下示例:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 5|
#| a| 8|
#| a| 7|
#| b| 1|
#| b| 3|
#+---+---+
创建一个 Window
以按列 A
进行分区,并使用它来计算每个组的最大值.然后过滤出行,使得 B
列中的值等于最大值.
Create a Window
to partition by column A
and use this to compute the maximum of each group. Then filter out the rows such that the value in column B
is equal to the max.
from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))
.where(f.col('B') == f.col('maxB'))
.drop('maxB')
.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+
或者等效地使用pyspark-sql
:
df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#| A| B|
#+---+---+
#| b| 3|
#| a| 8|
#+---+---+
这篇关于GroupBy 列并过滤 Pyspark 中具有最大值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!