本文介绍了将PySpark DataFrame分组后如何应用describe函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想找到将describe
函数应用于分组DataFrame的最简洁方法(这个问题也可能会扩大到将任何DF函数应用于分组DF)
I want to find the cleanest way to apply the describe
function to a grouped DataFrame (this question can also grow to apply any DF function to a grouped DF)
我没有幸运地测试了成组的集合熊猫UDF.总是有一种方法可以通过在agg
函数中传递每个统计信息,但这不是正确的方法.
I tested grouped aggregate pandas UDF with no luck. There's always a way of doing it by passing each statistics inside the agg
function but that's not the proper way.
如果我们有一个示例数据框:
If we have a sample dataframe:
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
这个想法是做类似于熊猫的事情:
The idea would be to do something similar to Pandas:
df.groupby("id").describe()
结果将是:
v
count mean std min 25% 50% 75% max
id
1 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0
2 3.0 6.0 3.605551 3.0 4.00 5.0 7.50 10.0
谢谢.
推荐答案
尝试一下:
df.groupby("id").agg(F.count('v').alias('count'), F.mean('v').alias('mean'), F.stddev('v').alias('std'), F.min('v').alias('min'), F.expr('percentile(v, array(0.25))')[0].alias('%25'), F.expr('percentile(v, array(0.5))')[0].alias('%50'), F.expr('percentile(v, array(0.75))')[0].alias('%75'), F.max('v').alias('max')).show()
输出:
+---+-----+----+------------------+---+----+---+----+----+
| id|count|mean| std|min| %25|%50| %75| max|
+---+-----+----+------------------+---+----+---+----+----+
| 1| 2| 1.5|0.7071067811865476|1.0|1.25|1.5|1.75| 2.0|
| 2| 3| 6.0| 3.605551275463989|3.0| 4.0|5.0| 7.5|10.0|
+---+-----+----+------------------+---+----+---+----+----+
这篇关于将PySpark DataFrame分组后如何应用describe函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!