问题描述
我有一个移动应用浏览历史记录数据集,如下所示.
I have a mobile app browsing history dataset as shown below.
- DeviceDateTime:用户在移动应用中查看页面的日期和时间.
- 用户ID:每个用户ID代表登录移动应用程序的访问者.
- PageName:移动应用中有不同的页面.所有访客首先会进入首页,然后导航至其他页面.
- PageSequence:已访问页面的顺序.例如,Seq_1主页>Seq_2我的帐户=首先登录首页,然后导航至我的帐户"页面.
2021-01-19 16:40:00.000 | UserA | 首页 | Seq_1 |
2021-01-19 16:40:00.000 | UserA | 我的帐户 | Seq_2 |
2021-01-19 16:40:07.000 | UserA | 我的活动 | Seq_3 |
2021-01-19 16:40:07.000 | UserA | 餐厅信息 | Seq_4 |
2021-01-19 16:40:18.000 | UserA | 餐厅详细信息页面 | Seq_5 |
2021-01-19 16:40:31.000 | UserA | 餐厅详细信息页面 | Seq_6 |
2021-01-19 16:40:31.000 | UserA | 餐厅预订确认 | Seq_7 |
2021-01-19 16:40:40.000 | UserA | 首页 | Seq_8 |
2021-01-19 16:40:45.000 | UserA | 写信给我们 | Seq_9 |
2021-01-19 16:40:46.000 | UserA | 首页 | Seq_10 |
2021-01-28 21:11:53.000 | UserB | 首页 | Seq_1 |
2021-01-28 21:12:01.000 | UserB | 餐厅信息 | Seq_2 |
2021-01-28 21:13:37.000 | UserB | 餐厅信息 | Seq_3 |
2021-02-16 09:43:27.000 | UserA | 首页 | Seq_1 |
2021-02-16 09:43:43.000 | UserA | 写信给我们 | Seq_2 |
2021-02-16 09:44:50.000 | UserA | 我的帐户 | Seq_3 |
2021-02-16 09:45:03.000 | UserA | 我的活动 | Seq_4 |
我需要在SQL中进行以下预处理:
I need to do the following pre-processing in SQL:
- 将数据集汇总到一个表中,如下所示.我想在表格中显示"FROM"(源页面)和"TO"(目标页面).例如,有1位访客已经经历了来自家"的旅程.页面转到我的帐户"页;2位访客已经经历了来自家"的旅程页面以写信给我们".
- 当旅程是从"和到"同一页面时,请勿计算旅程.例如,用户A Seq_5和Seq_6,"FROM餐厅详细信息页面"转到餐厅详细信息页面";不应该包含在内.
- 不应包括两个不同用户之间的旅程.例如,从餐厅列表"(用户B Seq_3)转到主页"(UserA Seq_1)不应该包含在内.
结果表:
首页 | 我的帐户 | 1 |
我的帐户 | 我的活动 | 2 |
我的活动 | 餐厅信息 | 1 |
餐厅信息 | 餐厅详细信息页面 | 1 |
餐厅详细信息页面 | 餐厅预订确认 | 1 |
餐厅预订确认 | 首页 | 1 |
首页 | 写信给我们 | 2 |
写信给我们 | 首页 | 1 |
首页 | 餐厅信息 | 1 |
首页 | 餐厅信息 | 1 |
写信给我们 | 我的帐户 | 1 |
我大约有60万用户,数据集中共有21个唯一的PageName.
I have around 600,000 users and a total of 21 unique PageName in the dataset.
我尝试了以下脚本,但是没有用.我未能在摘要表中记录所有可能的旅程.例如,FROM写给我们"从餐厅预订确认"到我的帐户"前往家"结果中都丢失了.
I have tried the following script but it didn't work. I failed to capture all the possible journeys in the summary table. For example, FROM "Write To Us" TO "My Account", FROM "Restaurant Booking Confirmation" TO "Home" are all missing in the result.
DROP TABLE IF EXISTS #App
CREATE TABLE #App (
DeviceDateTime SMALLDATETIME,
UserID VARCHAR(100),
PageName VARCHAR(100),
PageSequence VARCHAR(100))
INSERT INTO #App VALUES
('2021-01-19 16:40:00.000','UserA', 'Home', 'Seq_1'),
('2021-01-19 16:40:00.000','UserA', 'My Account', 'Seq_2'),
('2021-01-19 16:40:07.000','UserA', 'My Activity', 'Seq_3'),
('2021-01-19 16:40:07.000','UserA', 'Restaurant Listing', 'Seq_4'),
('2021-01-19 16:40:18.000','UserA', 'Restaurant Details Page', 'Seq_5'),
('2021-01-19 16:40:31.000','UserA', 'Restaurant Details Page', 'Seq_6'),
('2021-01-19 16:40:31.000','UserA', 'Restaurant Booking Confirmation', 'Seq_7'),
('2021-01-19 16:40:40.000','UserA', 'Home', 'Seq_8'),
('2021-01-19 16:40:45.000','UserA', 'Write To Use', 'Seq_9'),
('2021-01-19 16:40:46.000','UserA', 'Home', 'Seq_10'),
('2021-01-28 21:11:53.000','UserB', 'Home', 'Seq_1'),
('2021-01-28 21:12:01.000','UserB', 'Restaurant Listing', 'Seq_2'),
('2021-01-28 21:13:37.000','UserB', 'Restaurant Listing', 'Seq_3'),
('2021-02-16 09:43:27.000','UserA', 'Home', 'Seq_1'),
('2021-02-16 09:43:43.000','UserA', 'Write To Us', 'Seq_2'),
('2021-02-16 09:44:50.000','UserA', 'My Account', 'Seq_3'),
('2021-02-16 09:45:03.000','UserA', 'My Activity', 'Seq_4');
DROP TABLE IF EXISTS #SD
with seq_fixed as
(
select
UserID,
DeviceDateTime,
PageName,
cast(right(PageSequence, charindex('_', reverse(PageSequence)) - 1) as int) as pagesequencefinal
from #App
)
, with_next as
(
select
UserID,
DeviceDateTime,
PageName,
lead(PageName) over (partition by UserID, DeviceDateTime order by UserID, DeviceDateTime ASC) as next_pagename
from seq_fixed
group by UserID, DeviceDateTime, PageName
)
select PageName, next_pagename, count(*) AS No_of_User
into #SD
from with_next
where next_pagename is not null
group by PageName, next_pagename
order by PageName, next_pagename;
SELECT * FROM #SD
推荐答案
在您之前的请求中,您只想确认每位用户首次出现该页面,因此Home-> PageX-> Home-> PageY将被解释为Home-> PageX-> PageY.为此,您必须按用户和页面分组才能找到第一次出现的情况.
In your previous request you wanted to only acknowledge a page's first ocurrence per user,so Home->PageX->Home->PageY would be interpreted as Home->PageX->PageY. For this to happen you had to group by user and page to find this first occurrence.
在此新请求中不是这种情况,因此请不要汇总:
This is not the case in this new request, so don't aggregate:
with seq_fixed as
(
select
userid,
pagename,
cast(right(pagesequence, charindex('_', reverse(pagesequence)) - 1) as int) as pagesequencefinal
from app
)
, with_next as
(
select
userid,
pagename,
lead(pagename) over (partition by userid order by pagesequencefinal) as next_pagename
from seq_fixed
)
select pagename, next_pagename, count(*)
from with_next
where next_pagename is not null
group by pagename, next_pagename
order by pagename, next_pagename;
唯一的问题是:当用户昨天在PageX上结束并且今天从Home开始时,这将计为PageX-> Home.如果您想防止这种情况的发生,则需要针对这种情况进行一些检测,例如当其前任已存在至少1小时或类似时间时,请勿将条目视为页面更改.为此,您可以使用时间戳列和 LAG
.
The only problem with this: When a user ended on PageX yesterday and starts with Home today, this will count as PageX->Home. If you want to prevent this from happening you need some detection for this situation, e.g. don't consider an entry a page change when it's pedecessor is at least 1 hour old or the like. For this you can use your timestamp column and LAG
.
这篇关于SQL:使用LEAD()和PARTITION BY访问当前行之后的下一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!