本文介绍了通过组合类型和子类型的Apache Spark组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这个数据集,
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
我现在可以按城市和类似媒体将其分组,
I can now group this by city and media like this,
val groupByCityAndYear = sales
.groupBy("city", "media")
.count()
groupByCityAndYear.show()
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
+-------+--------+-----+
但是,我如何将媒体和动作结合在一栏中,所以预期的输出应该是
But, how can I do combine media and action together in one column, so the expected output should be,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share | 1|
| Warsaw|like | 1|
+-------+--------+-----+
推荐答案
合并 media
和 action
列为 array
列,爆炸
,然后执行 groupBy
count
:
Combine media
and action
columns as array
column, explode
it, then do groupBy
count
:
sales.select(
$"city", explode(array($"media", $"action")).as("mediaAction")
).groupBy("city", "mediaAction").count().show()
+-------+-----------+-----+
| city|mediaAction|count|
+-------+-----------+-----+
| Boston| share| 2|
| Boston| facebook| 1|
| Warsaw| share| 1|
| Boston| twitter| 1|
| Warsaw| like| 1|
|Toronto| twitter| 1|
|Toronto| like| 1|
| Warsaw| facebook| 2|
+-------+-----------+-----+
或者假设 media
和 action
不相交(这两列没有共同的元素):
Or assuming media
and action
doesn't intersect (the two columns don't have common elements):
sales.groupBy("city", "media").count().union(
sales.groupBy("city", "action").count()
).show
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
| Boston| share| 2|
| Warsaw| share| 1|
| Warsaw| like| 1|
|Toronto| like| 1|
+-------+--------+-----+
这篇关于通过组合类型和子类型的Apache Spark组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!