本文介绍了多次使用暂存模板进行部署时,数据流作业使用相同的BigQuery作业ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试部署一个从BigQuery读取并按固定时间表写入Cassandra的Dataflow作业.模板代码已使用Apache Beam和Dataflow库用Java编写.我已将该模板上载到Google Cloud Storage,并配置了Cloud Scheduler实例以及用于触发Dataflow模板的Cloud函数.我正在为所有Beam和BigQuery依赖项使用最新版本.

I am attempting to deploy a Dataflow job that reads from BigQuery and writes to Cassandra on a fixed schedule. The template code has been written in Java using Apache Beam, and the Dataflow library. I have staged the template onto Google Cloud Storage, and have configured a Cloud Scheduler instance as well as Cloud function used to trigger the Dataflow template. I am using the latest version for all Beam and BigQuery dependencies.

但是,我发现使用相同的暂存模板部署作业时,BigQuery提取作业似乎总是使用相同的作业ID,这会导致日志中显示409错误.BigQuery查询作业似乎成功,因为查询作业ID附加了唯一的后缀,而提取作业ID使用相同的前缀,但没有后缀.

However, I have discovered that when deploying a job using the same staged template, the BigQuery extract job seems to always use the same job ID, which causes a 409 failure shown in the logs. The BigQuery query job seems to be successful, because the query job ID has a unique suffix appended, while the extract job ID uses the same prefix, but without a suffix.

我已经考虑了两种替代解决方案:要么使用crontab将管道直接部署在计算引擎实例上以直接部署模板,要么使用Cloud函数来按计划执行与Dataflow管道相同的任务.理想情况下,如果有解决方案来更改Dataflow作业中的提取作业ID,那将是一个简单得多的解决方案,但我不确定是否可行?另外,如果这不可能,是否有更理想的替代解决方案?

I have considered two alternate solutions: either using a crontab to deploy the pipeline directly on a compute engine instance to deploy the template directly, or adapting a Cloud function to perform the same tasks as the Dataflow pipeline on a schedule. Ideally, if there is a solution for changing the extract job ID in the Dataflow job it would be a much easier solution but I'm not sure if this is possible? Also if this is not possible, is there an alternate solution that is more optimal?

推荐答案

基于其他描述,听起来这可能是不使用 withTemplateCompatability()如指示?

Based on the additional description, it sounds like this may be a case of not using withTemplateCompatability() as directed?

在模板中使用read()或readTableRows()时,需要指定BigQueryIO.Read.withTemplateCompatibility().不建议在非模板管道中指定此选项,因为它会降低性能.

When using read() or readTableRows() in a template, it's required to specify BigQueryIO.Read.withTemplateCompatibility(). Specifying this in a non-template pipeline is not recommended because it has somewhat lower performance.

这篇关于多次使用暂存模板进行部署时,数据流作业使用相同的BigQuery作业ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 22:28