消除BigQuery表中的重复记录 | 消除BigQuery表中的重复记录

本文介绍了消除BigQuery表中的重复记录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我打算每天将增量数据追加到BigQuery表中。每次向现有表添加增量数据时，我都想从表中现有的数据中删除重复的记录（基于主键列）。
一种方法是 - 从$增量数据中收集一组密钥（让我们称它为$ code
$ b

> INCR_KEYS ）

运行一个查询 - 从表中选择all_cols，其中pkey_col NOT IN（INCR_KEYS） - 并将结果存储在一个新表中。

 
 将增量数据附加到新表中。

我对这种方法的担忧是它创建了一张大表的副本，并增加了我的账单。

有没有更好的方法实现同样的功能而不创建重复的表格？

解决方案

我不知道如何在不创建重复表的情况下执行此操作 - 这实际上听起来像是一个非常聪明的解。

然而，您的增量成本很可能非常小--BigQuery只会向您收取数据的时间长度。如果您删除旧表格，则只需为这两个表格支付几秒或几分钟的时间。

 
I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table.One approach would be to - 
Collect the set of keys from the incremental data (lets call it INCR_KEYS)
Run a query on the lines of - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and store the results in a new table.
Append the incremental data to the new table.
My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills. 
Is there a better way of achieving the same without creating a duplicate table? 
 解决方案 
I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution. 
The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.
                        这篇关于消除BigQuery表中的重复记录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！