问题描述
我正尝试处理一些观看狂欢的统计信息,我想找出最长狂欢连胜期( binge 是被多个程序查看了让步,相距不超过2小时)。数据如下所示:
I'm attempting to crunch some binge-viewing stats and I'd like to find out how long the longest binge streak is (a binge being multiple programs viewed in concession, one after another, no more than 2 hours apart). The data looks like this:
datetime user_id program
2013-09-01 00:01:18 1 A
2013-09-10 14:03:14 1 B
2013-09-20 17:02:12 2 A
2013-09-21 00:03:22 2 C <-- user 2 binge start
2013-09-21 01:23:22 2 M
2013-09-21 03:03:22 2 E
2013-09-21 04:03:22 2 F
2013-09-21 06:03:22 2 G <-- user 2 binge end
2013-09-21 09:03:22 2 H
2013-09-03 18:21:09 3 D
2013-09-21 09:03:22 2 H
2013-09-24 19:21:00 2 X <-- user 2 second binge start
2013-09-24 20:21:00 2 Y
2013-09-24 21:21:00 2 Z <-- user 2 second binge end
在此示例中,用户2持续了6个小时的狂欢,后来又持续了2小时。
In this example user 2 had a binge that lasted 6 hours and later another that lasted 2 hours.
T他想要的最终结果是:
The end result I would like is something like:
user_id binge length
2 1 6 hours
2 2 2 hours
可以直接在数据库中计算吗?
Can this be calculated directly in the database?
推荐答案
这是识别数据中的序列/条纹的问题。我的首选方式是
This is a problem of identifying sequences/streak in the data. My preferred way of doing this is,
- 使用LAG函数来识别每个条纹的开始
- 使用SUM函数为每个条纹分配一个唯一的数字
- 然后按该唯一的数字分组以进行进一步处理
查询:
with start_grp as (
select dt, user_id, programme,
case when dt - lag(dt,1) over (partition by user_id order by dt)
> interval '0 day 2:00:00'
then 1
else 0
end grp_start
from binge
),
assign_grp as (
select dt, user_id, programme,
sum(grp_start) over (partition by user_id order by dt) grp
from start_grp)
select user_id, grp as binge, max(dt) - min(dt) as binge_length
from assign_grp
group by user_id, grp
having count(programme) > 1
这里的狂欢列可能不是顺序出现的。您可以在最终查询中使用ROW_NUMBER函数进行更正。
Here binge column may not come in sequential manner. You can use ROW_NUMBER function over the final query to correct it.
Demo位于
Demo at
这篇关于使用SQL计算最长的狂欢观看连胜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!