本文介绍了pyspark-在地图类型结构中创建DataFrame分组列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的 DataFrame 具有以下结构:
-------------------------
| Brand | type | amount|
-------------------------
| B | a | 10 |
| B | b | 20 |
| C | c | 30 |
-------------------------
我想通过将type
和amount
分组为 type 的单个列来减少行数:Map
因此Brand
将是唯一的,并且MAP_type_AMOUNT
对于每个type
amount
组合都将具有key,value
.
I want to reduce the amount of rows by grouping type
and amount
into one single column of type: Map
So Brand
will be unique and MAP_type_AMOUNT
will have key,value
for each type
amount
combination.
我认为Spark.sql可能具有一些功能来帮助完成此过程,还是我必须让RDD成为DataFrame并将我自己的自有"转换为地图类型?
I think Spark.sql might have some functions to help in this process, or do I have to get the RDD being the DataFrame and make my "own" conversion to map type?
预期:
-------------------------
| Brand | MAP_type_AMOUNT
-------------------------
| B | {a: 10, b:20} |
| C | {c: 30} |
-------------------------
推荐答案
对(对不起,我还不能发表评论)
Slight improvement to Prem's answer (sorry I can't comment yet)
使用func.create_map
代替func.struct
.请参见文档
import pyspark.sql.functions as func
df = sc.parallelize([('B','a',10),('B','b',20),
('C','c',30)]).toDF(['Brand','Type','Amount'])
df_converted = df.groupBy("Brand").\
agg(func.collect_list(func.create_map(func.col("Type"),
func.col("Amount"))).alias("MAP_type_AMOUNT"))
print df_converted.collect()
输出:
[Row(Brand=u'B', MAP_type_AMOUNT=[{u'a': 10}, {u'b': 20}]),
Row(Brand=u'C', MAP_type_AMOUNT=[{u'c': 30}])]
这篇关于pyspark-在地图类型结构中创建DataFrame分组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!