本文介绍了pyspark:汇总列中最频繁的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
想按城市和收入等级分组,但要在每个城市内郊区有不同的收入等级。如何按每个城市最常见的收入等级分组?
Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city?
例如:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10
将按收入_支架_10分组
Would be grouped by income_bracket_10
推荐答案
使用聚合前使用窗口函数可能会达到目的:
Using a window function before aggregating might do the trick:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
您还可以在汇总后使用窗口函数,因为您要保留(城市,收入居中)发生次数。
you can also use a window function after aggregating since you're keeping a count of (city, income_bracket) occurrences.
这篇关于pyspark:汇总列中最频繁的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!