本文介绍了Spark枢轴旋转一列,但保持其他部分完好无损的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
鉴于以下数据框,我该如何调整最高分但汇总总和数?
Given the following dataframe, how do I pivot the max scores but aggregate the sum of plays?
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0, 1),
("u1", "g3", 2, 2, 1),
("u1", "g3", 5, 3, 1),
("u1", "g4", 5, 4, 1),
("u2", "g2", 1, 1, 1),
], ["UserID", "GameID", "Score", "Time", "Plays"])
所需的输出
+------+-------------+-------------+-----+
|UserID|MaxScoreGame1|MaxScoreGame2|Plays|
+------+-------------+-------------+-----+
| u1| 10| 5| 4|
| u2| 1| null| 1|
+------+-------------+-------------+-----+
我在下面发布了一个解决方案,但希望避免使用join.
I posted a solution below but I'm hoping to avoid using join.
推荐答案
我认为这不是真正的改进,但是您可以添加总播放次数
I don't think it is a real improvement but you can add total number of plays
...
.select(
F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"),
F.sum("Plays").over(rowNumberWindow.orderBy()).alias("total_plays")
)
...
,以后将其用作 pivot
的辅助分组列:
and use it later as a secondary grouping column for pivot
:
...
.groupBy("UserID", "total_plays")
.pivot("GameCol", ["MaxScoreGame1", "MaxScoreGame2"])
.agg(F.max("Score"))
...
这篇关于Spark枢轴旋转一列,但保持其他部分完好无损的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!