问题描述
我想在BigQuery中创建一个视图,以允许以下操作:
I want to create a view in BigQuery that would allow for the following:
- 重复记录
- 将_PARTITIONTIME暴露为一列,以进行过滤以允许分区修剪
我知道建议执行重复数据删除的查询是:
I'm aware that the suggested query to do a de-duplication is:
SELECT
* EXCEPT(row_number)
FROM (
SELECT *
,ROW_NUMBER()
OVER (PARTITION BY fct.KeyColumn ORDER BY fct.CreatedDatetime DESC) row_number
FROM `my-project.my_dataset.fact_table` fct)
WHERE
row_number = 1
关于如何查询分区表的建议方法是:
And the suggested way on how to query a partitioned table is:
SELECT
*
FROM
`my-project.my_dataset.fact_table`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
AND TIMESTAMP('2016-01-02');
如何将两者结合在一起以提供一个已过重复数据删除的BigQuery视图,并提供一列以允许在视图的where子句中使用_PARTITIONTIME.我了解了如果使用子选择等对分区修剪的限制.这就是为什么我想知道这是否完全可能的原因.
How can I marry the two to provide a BigQuery view that is deduped and provides a column to allow the use of _PARTITIONTIME in a where clause in the view. I read about the limitations with partition pruning if using subselects etc. which is why I'm wondering whether this is possible at all.
为了给您一些背景信息,此视图位于要流式传输的表的顶部,因此重复数据删除必须实时进行.
To give you a bit of context, this view is on top of a table that is being streamed to so the dedupe has to happen on the fly.
推荐答案
您需要使视图每天返回最新的行,以启用对分区时间的过滤.会是这样的:
You would need to have the view return the latest row on a per-day basis to enable filtering on the partition time. It would be something like this:
#standardSQL
SELECT day, latest_row.*
FROM (
SELECT
_PARTITIONTIME AS day,
ARRAY_AGG(fct ORDER BY fct.CreatedDatetime DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM `my-project.my_dataset.fact_table` AS fct
GROUP BY day, KeyColumn
);
在外部查询中,latest_row.*
还应包含KeyColumn
.
In the outer query, latest_row.*
should also include KeyColumn
.
这篇关于BigQuery重复数据删除和分区表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!