问题描述
我正在运行一个数据流作业,该作业从BigQuery中读取并在8 GB of data and result in more than 50,000,000 records.
周围进行扫描.现在,我要一步一步地基于键进行分组,并且需要将一列连接起来.但是,当连接列的连接大小超过100 MB之后,为什么我必须在数据流作业中执行该分组依据,因为该分组依据无法在Bigquery level due to row size limit of 100 MB.
I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records.
Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
现在,从BigQuery读取数据流作业时,它的伸缩性很好,但被卡在Group by Step上,我有2个版本的数据流代码,但是两者都在卡入中. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
我希望按状态分组可以在20分钟内完成,但是会停留超过1个小时且永远不会完成
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
推荐答案
我自己弄清楚了这件事.以下是我在管道中所做的2项更改:1.我在按组分组"之后添加了一个合并功能,请参见屏幕截图
I figured out the thing myself.Below are the 2 changes that I did in my pipeline:1. I added a Combine function just after the Group by Key, see screenshot
- 由于在多个工作服务器上运行时按组进行分组,会进行大量的网络流量交换,并且默认情况下,我们使用的网络不允许网络间的通信,因此我必须创建一个防火墙规则以允许来自一个工人到另一个工人,即ip范围到网络流量.
这篇关于数据流Apache Beam Python作业被逐级卡住的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!