问题描述
我需要比较Hive中表的两个分区中的数据更改.具体来说,我有两个分区(ptn_dt = '01 -31-2019'和ptn_dt = '02 -28-2019').每个分区包含以下列-num_key和active_indicator(是或否).
I need to compare data changes in two partitions of table in Hive. Specifically, I have two partitions (ptn_dt='01-31-2019' and ptn_dt='02-28-2019'). Each partition contains the following columns - num_key and active_indicator (true or false).
我需要确定以下内容:
到目前为止,我已成功获取每个分区的计数,并使用下面的Hive SQL查询按active_indicator进行了分组.但是,尝试在分区之间进行比较时遇到麻烦.
So far, I was successful in getting the count per partition, grouped by active_indicator using the Hive SQL query below. However, I am having trouble when trying to compare in between partitions.
"""SELECT active_indicator,count(*) from table_name WHERE ptn_dt='2019-01-31' GROUP BY active_indicator"""
使用示例数据编辑并获得所需结果:
EDITED WITH SAMPLE DATA AND DESIRED RESULTS:
ptn_dt ='2019-01-31'
ptn_dt = '2019-01-31'
111 | true |
112 | false |
113 | false |
114 | false |
115 | true |
116 | true |
ptn_dt ='2019-02-28'
ptn_dt = '2019-02-28'
111 | true |
112 | false |
113 | true |
114 | true |
115 | true |
116 | true |
117 | true |
118 | false |
119 | true |
输出:
推荐答案
您可以使用条件聚合.这会将比较结果放在同一行:
You can use conditional aggregation. This puts the comparisons in the same row:
SELECT active_indicator,
SUM(CASE WHEN ptn_dt = '2019-01-31' THEN 1 ELSE 0 END),
SUM(CASE WHEN ptn_dt = '2019-02-28' THEN 1 ELSE 0 END)
FROM table_name
WHERE ptn_dt IN ('2019-01-31', '2019-02-28')
GROUP BY active_indicator;
或者,在不同的行中,您可以使用:
Or, in different rows, you could use:
SELECT active_indicator, ptn_dt, COUNT(*)
FROM table_name
WHERE ptn_dt IN ('2019-01-31', '2019-02-28')
GROUP BY active_indicator, ptn_dt;
根据您的评论,使用 lag()
.对于所有组合:
Based on your comment, use lag()
. For all combinations:
select prev_active_indicator, active_indicator, count(*)
from (select t.*,
lag(active_indicator) over (partition by num_key order by ptn_dt) as prev_active_indicator
from table_name t
where ptn_dt IN ('2019-01-31', '2019-02-28')
) t
where ptn_dt = '2019-02-28'
group by prev_active_indicator, active_indicator;
这篇关于比较Hive中表的两个分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!