是否有可能在 Pyspark 中同时为不同的列制作一个支点?
我有一个这样的数据框:
sdf = spark.createDataFrame(
pd.DataFrame([[1,'str1','str4'],[1,'str1','str4'],[1,'str2','str4'],[1,'str2','str5'],
[1,'str3','str5'],[2,'str2','str4'],[2,'str2','str4'],[2,'str3','str4'],
[2,'str3','str5']], columns = ['id','col1','col2'])
)
+----+------+------+
| id | col1 | col2 |
+----+------+------+
| 1 | str1 | str4 |
| 1 | str1 | str4 |
| 1 | str2 | str4 |
| 1 | str2 | str5 |
| 1 | str3 | str5 |
| 2 | str2 | str4 |
| 2 | str2 | str4 |
| 2 | str3 | str4 |
| 2 | str3 | str5 |
+----+------+------+
我想在多列(col1,col2,...)上旋转以获得一个如下所示的数据框:
+----+-----------+-----------+-----------+-----------+-----------+
| id | col1_str1 | col1_str2 | col1_str3 | col2_str4 | col2_str5 |
+----+-----------+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | 1 | 3 | 3 |
| 2 | 0 | 2 | 2 | 3 | 1 |
+----+-----------+-----------+-----------+-----------+-----------+
我找到了一个有效的解决方案(见下文),但我正在寻找比这个更紧凑的解决方案:
sdf_pivot_col1 = (
sdf
.groupby('id')
.pivot('col1')
.agg(sf.count('id'))
)
sdf_pivot_col2 = (
sdf
.groupby('id')
.pivot('col2')
.agg(sf.count('id'))
)
sdf_result = (
sdf
.select('id').distinct()
.join(sdf_pivot_col1, on = 'id' , how = 'left')
.join(sdf_pivot_col2, on = 'id' , how = 'left')
).show()
+---+----+----+----+----+----+
| id|str1|str2|str3|str4|str5|
+---+----+----+----+----+----+
| 1| 2| 2| 1| 3| 2|
| 2|null| 2| 2| 3| 1|
+---+----+----+----+----+----+
有没有更紧凑的方法来制作这些支点?
非常感谢!
最佳答案
通过@mrjoseph 的链接,我想出了以下解决方案:
它有效,它更干净,但我仍然不喜欢连接......
def pivot_udf(df, *cols):
mydf = df.select('id').drop_duplicates()
for c in cols:
mydf = mydf.join(
df
.withColumn('combcol',sf.concat(sf.lit('{}_'.format(c)),df[c]))
.groupby('id.pivot('combcol.agg(sf.count(c)),
how = 'left',
on = 'id'
)
return mydf
pivot_udf(sdf, 'col1','col2').show()
+---+---------+---------+---------+---------+---------+
| id|col1_str1|col1_str2|col1_str3|col2_str4|col2_str5|
+---+---------+---------+---------+---------+---------+
| 1| 2| 2| 1| 3| 2|
| 2| null| 2| 2| 3| 1|
+---+---------+---------+---------+---------+---------+
关于python - 如何在 pyspark 中分别在多个列上旋转,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57145661/