将行转为列级别

本文介绍了将行转为列级别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 spark 数据帧 t，它是 spark.sql("...") 查询的结果.这是 t 的前几行:

I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t:

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy | count |
|------------|------|-------------|---------|---------------|------------|-------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 1          | 23    |
| 2020-08-18 | 1    | first_name  | content | no_contact    | 0          | 346   |
| 2020-08-18 | 2    | second_name | content | implemented   | 1          | 64    |
| 2020-08-18 | 2    | second_name | content | implemented   | 0          | 5775  |
| 2020-08-18 | 3    | third_name  | content | implemented   | 1          | 54    |
| 2020-08-18 | 3    | third_name  | content | implemented   | 0          | 368   |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 1          | 88    |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 0          | 659   |

每个 x_id 有两行，这是由于 has_policy 上的分组.我想将 has_policy 和 count 转为列，这样我就可以为每个 x_id 分配一行.这是输出的样子:

There is two rows per x_id and this is due to grouping on has_policy. I would like to pivot has_policy and count to columns so I can have one row per x_id instead. This is how the output would look:

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy_count | has_no_policy_count |
|------------|------|-------------|---------|---------------|------------------|---------------------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 23               | 346                 |
| 2020-08-18 | 2    | second_name | content | implemented   | 64               | 5775                |
| 2020-08-18 | 3    | third_name  | content | implemented   | 54               | 368                 |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 88               | 659                 |

我不确定首先转换为 Pandas 是否更容易实现这一点，或者我们是否可以在 Spark df 上进行操作以获得相同的结果?

I'm not sure if it would be easier to achieve this by converting to Pandas first or if we can operate on the Spark df as it is to get the same result?

数据类型:

t.dtypes
[('yyyy_mm_dd', 'date'),
 ('xml_id', 'int'),
 ('xml_name', 'string'),
 ('b_app', 'string'),
 ('status', 'string'),
 ('has_policy', 'bigint'),
 ('count', 'bigint')]

将行转为列级别

问题描述

推荐答案