有没有可能在Pyspark中一次为不同的色谱柱做枢轴?
我有一个这样的数据框:
sdf = spark.createDataFrame(
pd.DataFrame([[1,2,6,1],[1,3,3,2],[1,6,0,3],[2,1,0,1],
[2,1,7,2],[2,7,8,3]], columns = ['id','val1','val2','month'])
)
+----+------+------+-------+
| id | val1 | val2 | month |
+----+------+------+-------+
| 1 | 2 | 6 | 1 |
| 1 | 3 | 3 | 2 |
| 1 | 6 | 0 | 3 |
| 2 | 1 | 0 | 1 |
| 2 | 1 | 7 | 2 |
| 2 | 7 | 8 | 3 |
+----+------+------+-------+
我想在多个列(val1,val2,...)上旋转此数据框,以使其具有如下所示的数据框:
+----+-------------+-------------+-------------+-------------+-------------+-------------+
| id | val1_month1 | val1_month2 | val1_month3 | val2_month1 | val2_month2 | val2_month3 |
+----+-------------+-------------+-------------+-------------+-------------+-------------+
| 1 | 2 | 3 | 6 | 6 | 3 | 0 |
| 2 | 1 | 1 | 7 | 0 | 7 | 8 |
+----+-------------+-------------+-------------+-------------+-------------+-------------+
我找到了一种适用于硬编码列的解决方案(请参见下文),但我正在寻找一种可以动态采用val1,val2等的解决方案。
sdf_pivot = (
sdf
.groupby('id')
.pivot('month')
.agg(sf.mean('val1'),sf.mean('val2'))
)
像这样的东西,但不幸的是这不起作用...
col_to_pivot = ['val1','val2']
sdf_pivot = (
sdf
.groupby('id')
.pivot('month')
.agg(sf.mean(col_to_pivot))
)
非常感谢!
最佳答案
IIUC,您可以使用列表理解:
newdf = sdf.groupby('id').pivot('month').agg(*[ sf.mean(c).alias(c) for c in col_to_pivot ])
#+---+------+------+------+------+------+------+
#| id|1_val1|1_val2|2_val1|2_val2|3_val1|3_val2|
#+---+------+------+------+------+------+------+
#| 1| 2| 6| 3| 3| 6| 0|
#| 2| 1| 0| 1| 7| 7| 8|
#+---+------+------+------+------+------+------+
col_names = [ '{}_month{}'.format(x[1],x[0]) if len(x)>1 else x[0] for c in newdf.columns for x in [c.split('_')] ]
#['id',
# 'val1_month1',
# 'val2_month1',
# 'val1_month2',
# 'val2_month2',
# 'val1_month3',
# 'val2_month3']
newdf = newdf.toDF(*col_names)
关于python - 如何在Pyspark中使用动态列旋转表,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58307217/