问题描述
我有一个表,其中包含每次访问端点的行.表格看起来像这样:
I have table containing row for each visit to an endpoint. Table looks like this:
user_id STRING
endpoint_id STRING
created_at TIMESTAMP
示例数据:
user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC
如何获取每个用户和资源的首次访问行.
构造这种查询的最佳方法是什么?
What is the best way to construct such query?
预期结果:
user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
这是我想出的,但是此查询不适用于大量数据.我使用窗口功能将废除用户/资源行分组在一起:
Here's what I came up with, but this query will not work for big amount of data. I use window function to group together repeaing user/resource rows:
SELECT
user_id,
endpoint_id,
created_at
FROM (
SELECT
poll_id,
endpoint_id,
created_at,
FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
FROM
[visits]
)
WHERE
created_at = first_created_at
推荐答案
在您提出问题的查询中-应该删除ORDER BY created_at DESC
中的DESC
,否则返回上次访问-而不是第一个
In query you presented in question - should remove DESC
in ORDER BY created_at DESC
otherwise it returns last visit - not first
另一种选择是使用ROW_NUMBER()如下
Another option would be to use ROW_NUMBER() as below
SELECT
user_id,
endpoint_id,
created_at
FROM (
SELECT
user_id,
endpoint_id,
created_at,
ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
FROM [visits]
)
WHERE first_created = 1
这真的取决于.如果user_id, endpoint_id
分区的大小足够大(因为ORDER BY要求分区的所有行都在同一节点上),则可能会发生Resources Exceeded
.
This really depends. Resources Exceeded
can happen If size of your user_id, endpoint_id
partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).
第1步-使用JOIN
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
第2步-这里还有其他需要注意的地方-以防万一您有相同用户/资源的重复条目.在这种情况下,您仍然只需要为每个分区提取一行.参见下面的最终查询
Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query
SELECT user_id, endpoint_id, created_at
FROM (
SELECT user_id, endpoint_id, created_at,
ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn
FROM (
SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
SELECT user_id, endpoint_id, MIN(created_at) AS min_time
FROM [visits]
GROUP BY user_id, endpoint_id
) AS tab2
ON tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time
)
)
WHERE rn = 1
SELECT user_id, endpoint_id, MIN(created_at) AS created_at
FROM [visits]
GROUP BY user_id, endpoint_id
这篇关于每组第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!