问题描述
我有一个 spark 数据帧 t
,它是 spark.sql("...")
查询的结果.这是 t
的前几行:
I have a spark dataframe t
which is the result of a spark.sql("...")
query. Here is the first few rows from t
:
| yyyy_mm_dd | x_id | x_name | b_app | status | has_policy | count |
|------------|------|-------------|---------|---------------|------------|-------|
| 2020-08-18 | 1 | first_name | content | no_contact | 1 | 23 |
| 2020-08-18 | 1 | first_name | content | no_contact | 0 | 346 |
| 2020-08-18 | 2 | second_name | content | implemented | 1 | 64 |
| 2020-08-18 | 2 | second_name | content | implemented | 0 | 5775 |
| 2020-08-18 | 3 | third_name | content | implemented | 1 | 54 |
| 2020-08-18 | 3 | third_name | content | implemented | 0 | 368 |
| 2020-08-18 | 4 | fourth_name | content | first_contact | 1 | 88 |
| 2020-08-18 | 4 | fourth_name | content | first_contact | 0 | 659 |
每个 x_id
有两行,这是由于 has_policy
上的分组.我想将 has_policy
和 count
转为列,这样我就可以为每个 x_id
分配一行.这是输出的样子:
There is two rows per x_id
and this is due to grouping on has_policy
. I would like to pivot has_policy
and count
to columns so I can have one row per x_id
instead. This is how the output would look:
| yyyy_mm_dd | x_id | x_name | b_app | status | has_policy_count | has_no_policy_count |
|------------|------|-------------|---------|---------------|------------------|---------------------|
| 2020-08-18 | 1 | first_name | content | no_contact | 23 | 346 |
| 2020-08-18 | 2 | second_name | content | implemented | 64 | 5775 |
| 2020-08-18 | 3 | third_name | content | implemented | 54 | 368 |
| 2020-08-18 | 4 | fourth_name | content | first_contact | 88 | 659 |
我不确定首先转换为 Pandas 是否更容易实现这一点,或者我们是否可以在 Spark df 上进行操作以获得相同的结果?
I'm not sure if it would be easier to achieve this by converting to Pandas first or if we can operate on the Spark df as it is to get the same result?
数据类型:
t.dtypes
[('yyyy_mm_dd', 'date'),
('xml_id', 'int'),
('xml_name', 'string'),
('b_app', 'string'),
('status', 'string'),
('has_policy', 'bigint'),
('count', 'bigint')]
推荐答案
假设 df
是你的数据框.pivot代码>
在您阅读文档时使用起来非常简单.
Assuming df
is your dataframe. pivot
is quite straight forward to use when you read the doc.
df.groupBy(
"yyyy_mm_dd", "x_id", "x_name", "b_app", "status"
).pivot("has_policy", [0, 1]).sum("count")
这篇关于将行转为列级别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!