本文介绍了GroupByKey并创建值为pyspark sql dataframe的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个类似Spark的数据框:

  a | b | c 
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

我想按列 a 分组,创建一个列表b列中的值,并忘记c。输出数据框应该是:

  a | b_list 
5 | (2,4)
2 | (4,3)

如何使用pyspark sql数据框执行此操作?



谢谢! :)

解决方案

以下是获取该数据框的步骤。

 >>>从pyspark.sql导入函数为F 
>>>
>>> d = [{'a':5,'b':2,'c':1},{'a':5,'b':4,'c':3},{'a':2, 'b':4,'c':2},{'a':2,'b':3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+ --- + --- + --- +
| A | C | ç|
+ --- + --- + --- +
| 5 | 2 | 1 |
| 5 | 4 | 3 |
| 2 | 4 | 2 |
| 2 | 3 | 7 |
+ --- + --- + --- +

>>> df1 = df.groupBy('a')。agg(F.collect_list(b))
>>> df1.show()
+ --- + --------------- +
|一个| collect_list(B)|
+ --- + --------------- +
| 5 | [2,4] |
| 2 | [4,3] |
+ --- + --------------- +


So I have a spark dataframe that looks like:

a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :

a | b_list
5 | (2,4)
2 | (4,3)

How would I go about doing this with a pyspark sql dataframe?

Thank you! :)

解决方案

Here are the steps to get that Dataframe.

>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  5|  2|  1|
|  5|  4|  3|
|  2|  4|  2|
|  2|  3|  7|
+---+---+---+

>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
|  a|collect_list(b)|
+---+---------------+
|  5|         [2, 4]|
|  2|         [4, 3]|
+---+---------------+

这篇关于GroupByKey并创建值为pyspark sql dataframe的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 13:52