问题描述
我有一个大型(5.7 亿行)每日状态表,可供 10 万多个用户使用.目前它在 MySQL(或 CSV)中.该表包含三列:user_id、状态和日期.理想情况下,我想将表减少到一个新表,其中包含每个状态周期(其中一个周期至少是一个日期)的 user_id、status、start_date、end_date.
I have a large (570m rows) daily status table for 100k+ users. Currently it is in MySQL (or CSV). The table contains three columns: user_id, status, and date. Ideally, I'd like to reduce the table to a new table that includes user_id, status, start_date, end_date for each status period (where a period is at least one date).
挑战在于用户可以在状态之间来回切换:我不能假设一段时间被定义为给定状态的最小和最大日期之间的差异.
The challenge is that users can switch back and forth between status: I cannot assume that a period is defined as the difference between min and max date for a given status.
我可以使用 MySQL、Python 或终端 (mac) 中的解决方案.
I can work with a solution in MySQL, Python or using the terminal (mac).
当前数据:
user_id, status, date
1, GRE, 2018-09-02
1, GRE, 2018-09-03
1, PRO, 2018-09-04
1, PRO, 2018-09-05
1, PRO, 2018-09-06
1, GRE, 2018-09-07
1, GRE, 2018-09-08
1, GRE, 2018-09-09
1, GRE, 2018-09-10
所需的新格式:
user_id, status, start_date, end_date
1, GRE, 2018-09-02, 2018-09-03
1, PRO, 2018-09-04, 2018-09-06
1, GRE, 2018-09-07, 2018-09-10
推荐答案
请考虑以下事项:
架构
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(user_id INT NOT NULL
,status CHAR(3) NOT NULL
,date DATE NOT NULL
,PRIMARY KEY(user_id,date)
);
INSERT INTO my_table VALUES
(1, 'GRE', '2018-09-02'),
(1, 'GRE', '2018-09-03'),
(1, 'PRO', '2018-09-04'),
(1, 'PRO', '2018-09-05'),
(1, 'PRO', '2018-09-06'),
(1, 'GRE', '2018-09-07'),
(1, 'GRE', '2018-09-08'),
(1, 'GRE', '2018-09-09'),
(1, 'GRE', '2018-09-10'),
(2, 'GRE', '2018-09-02'),
(2, 'GRE', '2018-09-03'),
(2, 'PRO', '2018-09-04'),
(2, 'PRO', '2018-09-05'),
(2, 'PRO', '2018-09-06'),
(2, 'GRE', '2018-09-07'),
(2, 'GRE', '2018-09-08'),
(2, 'GRE', '2018-09-09'),
(2, 'GRE', '2018-09-10');
查询
WITH t AS (
SELECT user_id
, status
, date
, DENSE_RANK() OVER (PARTITION BY user_id ORDER BY date)
- DENSE_RANK() OVER (PARTITION BY user_id,status ORDER BY DATE) grp
FROM my_table
)
SELECT t.user_id
, t.status
, MIN(t.date) start
, MAX(t.date) finish
FROM t
GROUP
BY user_id
, status
, grp
ORDER
BY user_id
, start;
+---------+--------+------------+------------+
| user_id | status | start | finish |
+---------+--------+------------+------------+
| 1 | GRE | 2018-09-02 | 2018-09-03 |
| 1 | PRO | 2018-09-04 | 2018-09-06 |
| 1 | GRE | 2018-09-07 | 2018-09-10 |
| 2 | GRE | 2018-09-02 | 2018-09-03 |
| 2 | PRO | 2018-09-04 | 2018-09-06 |
| 2 | GRE | 2018-09-07 | 2018-09-10 |
+---------+--------+------------+------------+
这篇关于减少每日状态表以仅包含状态更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!