本文介绍了将行转为列级别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 spark 数据帧 t,它是 spark.sql("...") 查询的结果.这是 t 的前几行:

I have a spark dataframe t which is the result of a spark.sql("...") query. Here is the first few rows from t:

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy | count |
|------------|------|-------------|---------|---------------|------------|-------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 1          | 23    |
| 2020-08-18 | 1    | first_name  | content | no_contact    | 0          | 346   |
| 2020-08-18 | 2    | second_name | content | implemented   | 1          | 64    |
| 2020-08-18 | 2    | second_name | content | implemented   | 0          | 5775  |
| 2020-08-18 | 3    | third_name  | content | implemented   | 1          | 54    |
| 2020-08-18 | 3    | third_name  | content | implemented   | 0          | 368   |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 1          | 88    |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 0          | 659   |

每个 x_id 有两行,这是由于 has_policy 上的分组.我想将 has_policycount 转为列,这样我就可以为每个 x_id 分配一行.这是输出的样子:

There is two rows per x_id and this is due to grouping on has_policy. I would like to pivot has_policy and count to columns so I can have one row per x_id instead. This is how the output would look:

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy_count | has_no_policy_count |
|------------|------|-------------|---------|---------------|------------------|---------------------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 23               | 346                 |
| 2020-08-18 | 2    | second_name | content | implemented   | 64               | 5775                |
| 2020-08-18 | 3    | third_name  | content | implemented   | 54               | 368                 |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 88               | 659                 |

我不确定首先转换为 Pandas 是否更容易实现这一点,或者我们是否可以在 Spark df 上进行操作以获得相同的结果?

I'm not sure if it would be easier to achieve this by converting to Pandas first or if we can operate on the Spark df as it is to get the same result?

数据类型:

t.dtypes
[('yyyy_mm_dd', 'date'),
 ('xml_id', 'int'),
 ('xml_name', 'string'),
 ('b_app', 'string'),
 ('status', 'string'),
 ('has_policy', 'bigint'),
 ('count', 'bigint')]

推荐答案

假设 df 是你的数据框.pivot 在您阅读文档时使用起来非常简单.

Assuming df is your dataframe. pivot is quite straight forward to use when you read the doc.

df.groupBy(
    "yyyy_mm_dd", "x_id", "x_name", "b_app", "status"
).pivot("has_policy", [0, 1]).sum("count")

这篇关于将行转为列级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 15:26