问题描述
我打算每天将增量数据追加到BigQuery表中。每次向现有表添加增量数据时,我都想从表中现有的数据中删除重复的记录(基于主键列)。
一种方法是 - 从$增量数据中收集一组密钥(让我们称它为$ code
$ b
- > INCR_KEYS )
- 运行一个查询 -
从表中选择all_cols,其中pkey_col NOT IN(INCR_KEYS)$ c $ b $> - 并将结果存储在一个新表中。
- 将增量数据附加到新表中。
我对这种方法的担忧是它创建了一张大表的副本,并增加了我的账单。
有没有更好的方法实现同样的功能而不创建重复的表格?
我不知道如何在不创建重复表的情况下执行此操作 - 这实际上听起来像是一个非常聪明的解。
然而,您的增量成本很可能非常小--BigQuery只会向您收取数据的时间长度。如果您删除旧表格,则只需为这两个表格支付几秒或几分钟的时间。
I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table.One approach would be to -
- Collect the set of keys from the incremental data (lets call it
INCR_KEYS
) - Run a query on the lines of -
SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS)
- and store the results in a new table. - Append the incremental data to the new table.
My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.
Is there a better way of achieving the same without creating a duplicate table?
I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.
The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.
这篇关于消除BigQuery表中的重复记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!