为什么PySpark中的agg

为什么PySpark中的agg

本文介绍了为什么PySpark中的agg()一次只能汇总一列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于以下数据框

df=spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)],schema=['name','High'])

当我尝试查找min&最高我只会在输出中获得最小值.

When I try to find min & max I am only getting min value in output.

df.agg({'High':'max','High':'min'}).show()
+-----------+
|min(High)  |
+-----------+
|    2094900|
+-----------+

为什么agg()不能同时给出max&像在熊猫里一样吗?

Why can't agg() give both max & min like in Pandas?

推荐答案

您可以看到此处:

计算汇总并以DataFrame的形式返回结果.

Compute aggregates and returns the result as a DataFrame.

可用的汇总函数是avg,max,min,sum,count.

The available aggregate functions are avg, max, min, sum, count.

如果exprs是从字符串到字符串的单个dict映射,则键是要对其执行聚合的列,而值是聚合函数.

If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function.

或者,exprs也可以是聚合列表达式的列表.

Alternatively, exprs can also be a list of aggregate Column expressions.

参数: exprs –从列名(字符串)到聚合函数(字符串)或列列表的字典映射.

Parameters: exprs – a dict mapping from column name (string) to aggregate functions (string), or a list of Column.

您可以使用列列表,并在每列上应用所需的功能,如下所示:

You can use a list of column and apply the function that you need on every column, like this:

>>> from pyspark.sql import functions as F

>>> df.agg(F.min(df.High),F.max(df.High),F.avg(df.High),F.sum(df.High)).show()
+---------+---------+---------+---------+
|min(High)|max(High)|avg(High)|sum(High)|
+---------+---------+---------+---------+
|      4.3|    7.677|   5.9885|   11.977|
+---------+---------+---------+---------+

这篇关于为什么PySpark中的agg()一次只能汇总一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-06 02:06