本文介绍了每组第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含每次访问端点的行.表格看起来像这样:

I have table containing row for each visit to an endpoint. Table looks like this:

user_id STRING
endpoint_id STRING
created_at TIMESTAMP

示例数据:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC

如何获取每个用户和资源的首次访问行.

构造这种查询的最佳方法是什么?

What is the best way to construct such query?

预期结果:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC


这是我想出的,但是此查询不适用于大量数据.我使用窗口功能将废除用户/资源行分组在一起:


Here's what I came up with, but this query will not work for big amount of data. I use window function to group together repeaing user/resource rows:

SELECT
    user_id,
    endpoint_id,
    created_at
FROM (
    SELECT
        poll_id,
        endpoint_id,
        created_at,
        FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
    FROM
        [visits]
    )
WHERE
    created_at = first_created_at

推荐答案

在您提出问题的查询中-应该删除ORDER BY created_at DESC中的DESC,否则返回上次访问-而不是第一个

In query you presented in question - should remove DESC in ORDER BY created_at DESC otherwise it returns last visit - not first

另一种选择是使用ROW_NUMBER()如下

Another option would be to use ROW_NUMBER() as below

 SELECT
  user_id,
  endpoint_id,
  created_at
FROM (
  SELECT
      user_id,
      endpoint_id,
      created_at,
      ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
  FROM [visits]
)
WHERE first_created = 1

这真的取决于.如果user_id, endpoint_id分区的大小足够大(因为ORDER BY要求分区的所有行都在同一节点上),则可能会发生Resources Exceeded.

This really depends. Resources Exceeded can happen If size of your user_id, endpoint_id partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).

第1步-使用JOIN

SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
FROM [visits] AS tab1
INNER JOIN (
  SELECT user_id, endpoint_id, MIN(created_at) AS min_time
  FROM [visits]
  GROUP BY user_id, endpoint_id
) AS tab2
ON  tab1.user_id = tab2.user_id
AND tab1.endpoint_id = tab2.endpoint_id
AND tab1.created_at = tab2.min_time

第2步-这里还有其他需要注意的地方-以防万一您有相同用户/资源的重复条目.在这种情况下,您仍然只需要为每个分区提取一行.参见下面的最终查询

Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query

 SELECT user_id, endpoint_id, created_at
FROM (
  SELECT user_id, endpoint_id, created_at,
    ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn
  FROM (
    SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at
    FROM [visits]  AS tab1
    INNER JOIN (
      SELECT user_id, endpoint_id, MIN(created_at) AS min_time
      FROM [visits]
      GROUP BY user_id, endpoint_id
    ) AS tab2
    ON  tab1.user_id = tab2.user_id
    AND tab1.endpoint_id = tab2.endpoint_id
    AND tab1.created_at = tab2.min_time
  )
)
WHERE rn = 1
SELECT user_id, endpoint_id, MIN(created_at) AS created_at
FROM [visits]
GROUP BY user_id, endpoint_id

这篇关于每组第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 00:55