本文介绍了如何在 PySpark DataFrame 中将列转置为行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如我有这个 DF

+----+------+------+-------+
| id | val1 | val2 | val3  |
+----+------+------+-------+
|  1 |   a  |   b  |   c   |
|  2 |   d  |   e  |   f   |
|  3 |   g  |   h  |   i   |

并且我希望以这种方式将列 val1val2val3 转置为行:

and I want the Columns val1, val2 and val3 transposed into Rows in this way:

+----+------+------+
| id |  key | value|
+----+------+------+
|  1 | val1 |   a  |
|  1 | val2 |   b  |
|  1 | val3 |   c  |
|  2 | val1 |   d  |
|  2 | val2 |   e  |
|  2 | val3 |   f  |
|  3 | val1 |   g  |
|  3 | val2 |   h  |
|  3 | val3 |   i  |

我怎样才能做到这一点?

How can I achieve that?

推荐答案

通过分解 Map 结构,可以获得这样的转换.所以首先将 3 列转换为 Map(基于 https://stackoverflow.com/a/41291156/4725074)然后爆炸:

By exploding a Map structure it is possible to get such transformation. So first transform the 3 columns into a Map (based on https://stackoverflow.com/a/41291156/4725074) for then exploding:

from pyspark.sql.functions as F
from itertools import chain

m = F.create_map(list(chain(*(
        (F.lit(c), F.col(c)) for c in df.columns if c in ["val1", "val2", "val3"]))))

df.withColumn('map', m)\
  .select('*', F.explode('map')).drop('map')

使用 select() 而不是 witchColumn() 很重要,因为分解 Map 列将生成 2 个新列,key> 和 value.

It's important to use a select() and not a witchColumn() as when exploding a Map column will generate 2 new columns, key and value.

结果将是如示例中报告的那样将所选列转置为行.

The result will be the transposition of the selected columns into rows as reported in the example.

这篇关于如何在 PySpark DataFrame 中将列转置为行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-05 08:35