最近在数据处理中用到了窗函数, 把使用方法记录一下, 暂时只有分组排序和滑动时间窗口的例子, 以后再逐步添加
场景
在SQL查询时, 会遇到有两类需要分组统计的场景, 在之前的SQL语法中是不方便实现的
- 场景1: 顾客维修设备的记录表, 每次维修产生一条记录, 每个记录包含时间, 顾客ID和维修金额, 要取出每个顾客的维修次数和最后一次维修时的金额
- 场景2: 还是上面的维修记录表, 要取出每个顾客的每次维修之间的时间间隔
- 场景3: 一个用户账户的交易流水表, 要求每个小时的交易笔数和平均收支金额, 这个平均数的统计范围是两个小时(整点时间的前后一个小时)
使用窗函数直接SQL中使用窗函数就能解决这些问题, 否则需要使用临时表, 函数或存储过程进行处理.
窗函数
PostgreSQL 从2010年的版本8开始就支持窗函数了.
文档
详细说明建议查看官方文档 https://www.postgresql.org/docs/current/tutorial-window.html
函数说明
窗函数(window function)的计算方式与传统的单行和聚合不同
- 窗函数是在当前表中, 基于当前行的相关行的计算, 注意是基于多行的计算
- 属于一种聚合计算, 可以使用聚合类型的函数(aggregate function)
- 使用窗函数并不会导致结果的聚合, 也就是结果依然是当前的行结构
所以综合的说, 窗口函数就是在行的基础上, 允许对多行数据进行计算. 下面是一个简单的窗函数例子, 将每个员工的薪资与其所在的部门的平均薪资进行比较
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
关键词
使用窗函数时会用到的一些关键词
- OVER 前面的查询基于后面的窗口
- PARTITION BY 类似于 GROUP BY 的语义, 专用于窗口的分组
- ORDER BY 窗内的排序依据, 依据的字段决定了 RANGE 的类型
- RANGE ... PRECEDING 在当前值之前的范围, 基准是当前记录这个 ORDER BY 字段的值
- RANGE ... FOLLOWING 在当前值之后的范围, 基准是当前记录这个 ORDER BY 字段的值
- RANGE BETWEEN ... PRECEDING AND ... FOLLOWING 前后范围的组合
- WINDOW 将窗口命名为变量, 可以在 SELECT 中重复使用
示例
按窗口打序号
功能: 将数据按指定的字段分组, 再按另一个字段排列, 给每个分组里的数据打上序号.
这是一个常用技巧, 例如要计算各组内记录之间的时间间隔, 但是用时间不方便join, 打完序号后就可以用序号join了
SELECT
ROW_NUMBER() OVER w1 AS rn,
sample_01.*
FROM
sample_01
WINDOW
w1 AS (PARTITION BY field_name ORDER BY created_at ASC);
简单时间窗口统计
功能: 将数据表按指定字段(日期类型)进行排序, 然后基于每个记录的这个字段创建一个固定宽度的时间窗口, 对窗口内的多个记录进行统计
统计单个字段, 可以直接写在select中
SELECT
MAX(amount) OVER (ORDER BY traded_at RANGE '30 minutes' PRECEDING) AS amount_max,
*
FROM sample_01
WHERE card_num = '6210812500006111111'
基于时间窗口变量进行多字段统计
功能: 和前一个功能一样, 但是要进行多个不同的统计, 要重复用到这个窗口函数
如果要统计多个字段, 可以抽出单独的WINDOW
SELECT
MAX(rn) OVER w1 AS rn_max,
MAX(amount) OVER w1 AS amount_max,
AVG(amount) OVER w1 AS amount_avg,
*
FROM sample_01_diff
WINDOW
-- w1 AS (ORDER BY traded_at RANGE '30 minutes' PRECEDING)
w1 AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING)
ORDER BY
rn ASC
在这个例子中
- 先依据 card_num 这个字段进行分区,
- 然后按 traded_at 这个字段进行排序,
- 对每个记录的 traded_at 值, 开启一个 RANGE, 包含前面的30分钟和后面的30分钟, RANGE 中能用的类型和 ORDER BY 的字段类型是相关的
- SELECT中的 MAX, MIN 等聚合函数, 是基于上面的 RANGE 进行的
多个窗口多个字段同时统计
功能: 在前面的功能基础上, 同时存在多个时间窗口
SELECT
-- 1 hour
SUM(amount_in) OVER w1h AS h1_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_in_count,
SUM(amount_out) OVER w1h AS h1_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w1h AS h1_amount_out_count,
SUM(amount) OVER w1h AS h1_amount_sum,
COUNT(amount) OVER w1h AS h1_amount_count,
ROUND(AVG(amount) OVER w1h, 2) AS h1_amount_avg,
FIRST_VALUE(amount) OVER w1h AS h1_amount_first,
LAST_VALUE(amount) OVER w1h AS h1_amount_last,
MAX(amount) OVER w1h AS h1_amount_max,
MIN(amount) OVER w1h AS h1_amount_min,
-- 3 hour
SUM(amount_in) OVER w3h AS h3_amount_in_sum,
SUM(
CASE
WHEN amount_in = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_in_count,
SUM(amount_out) OVER w3h AS h3_amount_out_sum,
SUM(
CASE
WHEN amount_out = 0 THEN 0
ELSE 1
END
) OVER w3h AS h3_amount_out_count,
SUM(amount) OVER w3h AS h3_amount_sum,
COUNT(amount) OVER w3h AS h3_amount_count,
ROUND(AVG(amount) OVER w3h, 2) AS h3_amount_avg,
FIRST_VALUE(amount) OVER w3h AS h3_amount_first,
LAST_VALUE(amount) OVER w3h AS h3_amount_last,
MAX(amount) OVER w3h AS h3_amount_max,
MIN(amount) OVER w3h AS h3_amount_min,
*
FROM sample_01
WINDOW
w1h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '30 minutes' PRECEDING AND '30 minutes' FOLLOWING),
w3h AS (PARTITION BY card_num ORDER BY traded_at RANGE BETWEEN '90 minutes' PRECEDING AND '90 minutes' FOLLOWING)
;
参考
- https://www.postgresql.org/docs/current/tutorial-window.html
- https://stackoverflow.com/questions/14989927/sql-sliding-window-finding-max-value-over-interval
- https://www.postgresqltutorial.com/postgresql-window-function/
- https://www.citusdata.com/blog/2018/06/01/fun-with-sql-window-functions-in-postgresql/
- https://tapoueh.org/blog/2013/08/understanding-window-functions/