在BigQuery中滚动活跃用户90天，以提高性能(DAU/MAU/WAU)

本文介绍了在BigQuery中滚动活跃用户90天，以提高性能(DAU/MAU/WAU)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试获取特定日期(回溯90/30/7天)的唯一事件的数量.我已经在以下查询行中处理了有限的行，但是对于大型数据集，我从聚合字符串中获取了内存错误，该错误变得很大.

I'm trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I've got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.

我正在寻找一种更有效的方法来获得相同的结果.

I'm looking for a more effective way of achieving the same result.

表看起来像这样:

+---+------------+-------------+
|   |     date   |     userid  |
+---+------------+-------------+
| 1 | 2013-05-14 | xxxxx       |
| 2 | 2017-03-14 | xxxxx       |
| 3 | 2018-01-24 | xxxxx       |
| 4 | 2013-03-21 | xxxxx       |
| 5 | 2014-03-19 | xxxxx       |
| 6 | 2015-09-03 | xxxxx       |
| 7 | 2014-02-06 | xxxxx       |
| 8 | 2014-10-30 | xxxxx       |
| ..| ...        | ...         |
+---+------------+-------------+

所需结果的格式:

+---+------------+---------------------------------------------+
|   |     date   | active_users_7_days | active_users_90_days  |
+---+------------+---------------------------------------------+
| 1 | 2013-05-14 | 1240                | 34339                 |
| 2 | 2017-03-14 | 4334                | 54343                 |
| 3 | 2018-01-24 | .....               | .....                 |
| 4 | 2013-03-21 | .....               | .....                 |
| 5 | 2014-03-19 | .....               | .....                 |
| 6 | 2015-09-03 | .....               | .....                 |
| 7 | 2014-02-06 | .....               | .....                 |
| 8 | 2014-10-30 | .....               | .....                 |
| ..| ...        | .....               | .....                 |
+---+------------+---------------------------------------------+

我的查询如下:

#standardSQL
    WITH
      T1 AS(
      SELECT
        date,
        STRING_AGG(DISTINCT userid) AS IDs
      FROM
        `consumer.events`
      GROUP BY
        date ),
      T2 AS(
      SELECT
        date,
        STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING
          AND CURRENT ROW) AS IDs
      FROM
        T1 )
    SELECT
      date,
      (
      SELECT
        COUNT(DISTINCT (userid))
      FROM
        UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays
    FROM
      T2

推荐答案

计算唯一用户数需要大量资源，如果您希望在滚动窗口中获得结果，则需要更多资源.对于可扩展的解决方案，请研究HLL ++之类的近似算法:

Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:

https://medium.freecodecamp.org /counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a

对于确切的计数，这可以工作(但是随着窗口变大，速度会变慢):

For an exact count, this would work (but gets slower as the window gets larger):

#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
 , COUNT(DISTINCT owner_user_id) unique_90_day_users
 , COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
 , COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
FROM (
  SELECT DATE(creation_date) date, owner_user_id
  FROM `bigquery-public-data.stackoverflow.posts_questions`
  WHERE EXTRACT(YEAR FROM creation_date)=2017
  GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

近似解产生结果的速度更快(14s vs 366s，但是结果是近似的):

The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):

#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
 , HLL_COUNT.MERGE(sketch) unique_90_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
  SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
  FROM `bigquery-public-data.stackoverflow.posts_questions`
  WHERE EXTRACT(YEAR FROM creation_date)=2017
  GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

更新后的查询可提供正确的结果-删除少于90天的行(在不缺少日期的情况下有效):

Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):

#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
 , HLL_COUNT.MERGE(sketch) unique_90_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
 , COUNT(*) window_days
FROM (
  SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
  FROM `bigquery-public-data.stackoverflow.posts_questions`
  WHERE EXTRACT(YEAR FROM creation_date)=2017
  GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp

这篇关于在BigQuery中滚动活跃用户90天，以提高性能(DAU/MAU/WAU)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！