问题描述
我刚刚在 BigQuery 中遇到了一个有趣的问题.
I just came a cross an interesting issue with the BigQuery.
本质上有一个批处理作业可以在 BigQuery 中重新创建一个表 - 删除数据 - 然后立即开始通过流接口输入新的集合.
Essentially there is a batch job that recreates a table in BigQuery - to delete the data - and than immediately starts to feed in a new set through streaming interface.
曾经这样工作了很长一段时间 - 成功了.
Used to work like this for quite a while - successfully.
最近它开始丢失数据.
一个小测试案例证实了这种情况——如果数据馈送在重新创建(成功!)表格后立即开始,部分数据集将丢失.IE.在输入的 4000 条记录中,只有 2100 - 3500 条能够通过.
A small test case has confirmed the situation – if the data feed starts immediately after recreating (successfully!) the table, parts of the dataset will be lost.I.e. Out of 4000 records that are being fed in, only 2100 - 3500 would make it through.
我怀疑在表操作(删除和创建)在整个环境中成功传播之前,表创建可能会返回成功,因此数据集的第一部分被馈送到表的旧副本(推测此处).
I suspect that table creation might be returning success before the table operations (deletion and creation) have been successfully propagated throughout the environment, thus the first parts of the dataset are being fed to the old replicas of the table (speculating here).
为了确认这一点,我在表创建和开始数据馈送之间设置了超时.事实上,如果超时时间小于 120 秒 – 部分数据集将丢失.
To confirm this I have put a timeout between the table creation and starting the data feed. Indeed, if the timeout is less than 120 seconds – parts of the dataset are lost.
如果超过 120 秒 - 似乎工作正常.
If it is more than 120 seconds - seems to work OK.
过去没有此超时要求.我们正在使用美国 BigQuery.我在这里遗漏了一些明显的东西吗?
There used to be no requirement for this timeout. We are using US BigQuery.Am I missing something obvious here?
从下面的 Sean Chen 和其他一些来源提供的评论中 - 由于表的缓存方式和内部表 id 在整个系统中传播的方式,预期行为.BigQuery 专为仅追加类型的操作而构建.重写不是很容易融入设计的东西,应该避免.
From the comment provided by Sean Chen below and a few other sources - the behaviour is expected due to the way the tables are cached and internal table id is propagated through out the system. BigQuery has been built for append-only type of operations. Re-writes is not something that one can easily accomodate into the design and should be avoided.
推荐答案
由于 BigQuery 流服务器缓存表生成 ID(表的内部名称)的方式,这或多或少是在意料之中的.
This is more or less expected due to the way that BigQuery streaming servers cache the table generation id (an internal name for the table).
您能否提供有关用例的更多信息?删除表然后再次写入同一个表似乎很奇怪.
Can you provide more information about the use case? It seems strange to delete the table then to write to the same table again.
一种解决方法可能是截断表,而不是删除它.您可以通过运行 One workaround could be to truncate the table, instead of deleting the it. You can do this by running 这篇关于重新创建 BigQuery 表后,流式插入不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!SELECT * FROM
来完成此操作.LIMIT 0,并将该表作为目标表(您可能希望使用 allow_large_results = true 并禁用展平,如果您有嵌套数据,这将有所帮助),然后使用 write_disposition=WRITE_TRUNCATE.这将清空表但保留模式.然后,之后流式传输的任何行都将应用于同一个表.
SELECT * FROM <table> LIMIT 0
, and the table as a destination table (you might want to use allow_large_results = true and disable flattening, which will help if you have nested data), then using write_disposition=WRITE_TRUNCATE. This will empty out the table but preserve the schema. Then any rows streamed afterwards will get applied to the same table.