BigQuery重复数据删除和分区

BigQuery重复数据删除和分区

本文介绍了BigQuery重复数据删除和分区表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在BigQuery中创建一个视图,以允许以下操作:

I want to create a view in BigQuery that would allow for the following:

  • 重复记录
  • 将_PARTITIONTIME暴露为一列,以进行过滤以允许分区修剪

我知道建议执行重复数据删除的查询是:

I'm aware that the suggested query to do a de-duplication is:

SELECT
  * EXCEPT(row_number)
  FROM (
SELECT *
    ,ROW_NUMBER()
          OVER (PARTITION BY fct.KeyColumn ORDER BY fct.CreatedDatetime DESC) row_number
FROM `my-project.my_dataset.fact_table` fct)
WHERE
  row_number = 1

关于如何查询分区表的建议方法是:

And the suggested way on how to query a partitioned table is:

SELECT
  *
FROM
  `my-project.my_dataset.fact_table`
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02');

如何将两者结合在一起以提供一个已过重复数据删除的BigQuery视图,并提供一列以允许在视图的where子句中使用_PARTITIONTIME.我了解了如果使用子选择等对分区修剪的限制.这就是为什么我想知道这是否完全可能的原因.

How can I marry the two to provide a BigQuery view that is deduped and provides a column to allow the use of _PARTITIONTIME in a where clause in the view. I read about the limitations with partition pruning if using subselects etc. which is why I'm wondering whether this is possible at all.

为了给您一些背景信息,此视图位于要流式传输的表的顶部,因此重复数据删除必须实时进行.

To give you a bit of context, this view is on top of a table that is being streamed to so the dedupe has to happen on the fly.

推荐答案

您需要使视图每天返回最新的行,以启用对分区时间的过滤.会是这样的:

You would need to have the view return the latest row on a per-day basis to enable filtering on the partition time. It would be something like this:

#standardSQL
SELECT day, latest_row.*
FROM (
  SELECT
    _PARTITIONTIME AS day,
    ARRAY_AGG(fct ORDER BY fct.CreatedDatetime DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM `my-project.my_dataset.fact_table` AS fct
  GROUP BY day, KeyColumn
);

在外部查询中,latest_row.*还应包含KeyColumn.

In the outer query, latest_row.* should also include KeyColumn.

这篇关于BigQuery重复数据删除和分区表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 03:15