本文介绍了Pyspark-多列聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有如下数据.文件名:babynames.csv.
I have data like below. Filename:babynames.csv.
year name percent sex
1880 John 0.081541 boy
1880 William 0.080511 boy
1880 James 0.050057 boy
我需要根据年份和性别对输入进行排序,并且希望将输出汇总如下(此输出将分配给新的RDD).
I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).
year sex avg(percentage) count(rows)
1880 boy 0.070703 3
我不确定在pyspark中执行以下步骤后如何继续.在这方面需要您的帮助
I am not sure how to proceed after the following step in pyspark. Need your help on this
testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????
推荐答案
- 按照自述文件中的说明进行操作,以包含
spark-csv
包 -
加载数据
- Follow the instructions from the README to include
spark-csv
package Load data
df = (sqlContext.read
.format("com.databricks.spark.csv")
.options(inferSchema="true", delimiter=";", header="true")
.load("babynames.csv"))
导入所需功能
Import required functions
from pyspark.sql.functions import count, avg
分组依据和汇总(可选使用Column.alias
:
df.groupBy("year", "sex").agg(avg("percent"), count("*"))
或者:
- 将
percent
转换为数字 - 重塑为格式((
year
,sex
),percent
) -
aggregateByKey
使用pyspark.statcounter.StatCounter
- cast
percent
to numeric - reshape to a format ((
year
,sex
),percent
) aggregateByKey
usingpyspark.statcounter.StatCounter
这篇关于Pyspark-多列聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!