本文介绍了如何使用Trigger.Once选项在Spark 3 Structure Stream Kafka/Files源中配置反向压力的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark 3中,更改了Kafka上的反压选项,并且Trigger.once场景的文件源已更改.

但是我有一个问题.当我想使用TriggerOnce时如何为我的工作配置背压?

在spark 2.4中,我有一个用例,以回填一些数据,然后启动流.所以我只使用一次触发器,但是我的回填场景可能非常大,有时由于混洗和驱动程序内存而对磁盘造成过大的负载,因为FileIndex缓存在此.因此,我使用max maxOffsetsPerTrigger maxFilesPerTrigger 来控制我的火花可以处理多少数据.这就是我配置背压的方式.

现在您删除了此功能,因此假设有人可以提出一条新的走法吗?

解决方案

Trigger.Once 现在会立即忽略这些选项(在Spark 3中),因此它将始终在首次加载时读取所有内容./p>

您可以解决此问题-例如,您可以将触发器设置为定期启动流,并使用一些值(例如1小时),并且不执行 .awaitTermination ,但是会有一个并行循环将检查第一批是否完成,并停止流.或者,您可以将其设置为连续模式,然后检查批处理是否具有0行,然后终止流.初始加载后,您可以将流切换回Trigger.Once

In Spark 3 Behave of backpressure option on Kafka and File Source for trigger.once scenario was changed.

But I have a question.How can I configure backpressure to my job when I want to use TriggerOnce?

In spark 2.4 I have a use case, to backfill some data and then start the stream.So I use trigger once, but my backfill scenario can be very very big and sometimes create too big a load on my disks because of shuffles and to driver memory because FileIndex cached there.SO I use max maxOffsetsPerTrigger and maxFilesPerTrigger to control how much data my spark can process. that's how I configure backpressure.

And now you remove this ability, so assume someone can suggest a new way to go?

解决方案

Trigger.Once ignores these options right now (in Spark 3), so it always will read everything on the first load.

You can workaround that - for example, you can start stream with trigger set to periodic, with some value like, 1 hour, and don't execute .awaitTermination, but have a parallel loop that will check if first batch is done, and stop the stream. Or you can set it to continuous mode, and then check if batches having 0 rows, and then terminate the stream. After that initial load you can switch stream back to Trigger.Once

这篇关于如何使用Trigger.Once选项在Spark 3 Structure Stream Kafka/Files源中配置反向压力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 07:06