问题描述
在Spark 3中,更改了Kafka上的反压选项,并且Trigger.once场景的文件源已更改.
但是我有一个问题.当我想使用TriggerOnce时如何为我的工作配置背压?
在spark 2.4中,我有一个用例,以回填一些数据,然后启动流.所以我只使用一次触发器,但是我的回填场景可能非常大,有时由于混洗和驱动程序内存而对磁盘造成过大的负载,因为FileIndex缓存在此.因此,我使用max maxOffsetsPerTrigger
和 maxFilesPerTrigger
来控制我的火花可以处理多少数据.这就是我配置背压的方式.
现在您删除了此功能,因此假设有人可以提出一条新的走法吗?
Trigger.Once
现在会立即忽略这些选项(在Spark 3中),因此它将始终在首次加载时读取所有内容./p>
您可以解决此问题-例如,您可以将触发器设置为定期启动流,并使用一些值(例如1小时),并且不执行 .awaitTermination
,但是会有一个并行循环将检查第一批是否完成,并停止流.或者,您可以将其设置为连续模式,然后检查批处理是否具有0行,然后终止流.初始加载后,您可以将流切换回Trigger.Once
But I have a question.How can I configure backpressure to my job when I want to use TriggerOnce?
In spark 2.4 I have a use case, to backfill some data and then start the stream.So I use trigger once, but my backfill scenario can be very very big and sometimes create too big a load on my disks because of shuffles and to driver memory because FileIndex cached there.SO I use max maxOffsetsPerTrigger
and maxFilesPerTrigger
to control how much data my spark can process. that's how I configure backpressure.
And now you remove this ability, so assume someone can suggest a new way to go?
Trigger.Once
ignores these options right now (in Spark 3), so it always will read everything on the first load.
You can workaround that - for example, you can start stream with trigger set to periodic, with some value like, 1 hour, and don't execute .awaitTermination
, but have a parallel loop that will check if first batch is done, and stop the stream. Or you can set it to continuous mode, and then check if batches having 0 rows, and then terminate the stream. After that initial load you can switch stream back to Trigger.Once
这篇关于如何使用Trigger.Once选项在Spark 3 Structure Stream Kafka/Files源中配置反向压力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!