Spark枢轴旋转一列，但保持其他部分完好无损

本文介绍了Spark枢轴旋转一列，但保持其他部分完好无损的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

鉴于以下数据框，我该如何调整最高分但汇总总和数?

Given the following dataframe, how do I pivot the max scores but aggregate the sum of plays?

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql import Window

df = sqlContext.createDataFrame([
    ("u1", "g1", 10, 0, 1),
    ("u1", "g3", 2, 2, 1),
    ("u1", "g3", 5, 3, 1),
    ("u1", "g4", 5, 4, 1),
    ("u2", "g2", 1, 1, 1),
], ["UserID", "GameID", "Score", "Time", "Plays"])

所需的输出

+------+-------------+-------------+-----+
|UserID|MaxScoreGame1|MaxScoreGame2|Plays|
+------+-------------+-------------+-----+
|    u1|           10|            5|    4|
|    u2|            1|         null|    1|
+------+-------------+-------------+-----+

我在下面发布了一个解决方案，但希望避免使用join.

I posted a solution below but I'm hoping to avoid using join.

推荐答案

我认为这不是真正的改进，但是您可以添加总播放次数

I don't think it is a real improvement but you can add total number of plays

...
.select(
  F.col("*"),
  F.row_number().over(rowNumberWindow).alias("GameNumber"),
  F.sum("Plays").over(rowNumberWindow.orderBy()).alias("total_plays")
)
...

，以后将其用作 pivot 的辅助分组列:

and use it later as a secondary grouping column for pivot:

...
.groupBy("UserID", "total_plays")
.pivot("GameCol", ["MaxScoreGame1", "MaxScoreGame2"])
.agg(F.max("Score"))
...

这篇关于Spark枢轴旋转一列，但保持其他部分完好无损的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

JOIN