问题描述
我们正尝试使用Apache Beam和avro写入Big Query.
We are trying to write to Big Query using Apache Beam and avro.
以下内容似乎可以正常工作:-
The following seems to work ok:-
p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema));
然后,我们尝试以以下方式使用它从Google Pub/Sub获取数据
We then tried to use it in the following manner to get data from the Google Pub/Sub
p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
执行此操作时,它总是将其推入缓冲区,并且Big Query似乎需要很长时间才能从缓冲区中读取数据.谁能告诉我上面的内容为什么不将记录直接写到Big Query表中?
When we do this it always pushes it to the buffer and Big Query seems to take a long time to read from the buffer. Can anyone tell me why the above won't write the records directly to the Big Query tables?
更新:-看来我需要添加以下设置,但这会引发java.lang.IllegalArgumentException.
UPDATE:-It looks like I need add the following settings but this throws an java.lang.IllegalArgumentException.
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
推荐答案
答案是您需要像这样包含"withNumFileShards"(可以是1到1000).
The answer is you need to include "withNumFileShards" like so (Can be 1 to 1000).
p.begin()
.apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
.apply("Transform", ParDo.of(new CustomTransformFunction()))
.apply("Write", BigQueryIO.writeTableRows()
.to(table)
.withSchema(schema)
.withTimePartitioning(timePartitioning)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
.withNumFileShards(1000)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
p.run().waitUntilFinish();
我在任何地方都找不到此文档,不能说withNumFileShards是强制性的,但是在修复后我找到了一张Jira票证.
I can't find this documented anywhere to say that withNumFileShards is mandatory however there is a Jira ticket for this which I found after the fix.
https://issues.apache.org/jira/browse/BEAM-3198
这篇关于BigQuery writeTableRows始终写入缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!