问题描述
是否可以限制 Kafka 消费者为 Spark Streaming 返回的批次大小?
Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
我之所以这么问是因为我得到的第一批记录有数亿条记录,处理和检查它们需要很长时间.
I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.
推荐答案
我认为您的问题可以通过 Spark Streaming Backpressure 来解决.
I think your problem can be solved by Spark Streaming Backpressure.
检查 spark.streaming.backpressure.enabled
和 spark.streaming.backpressure.initialRate
.
默认spark.streaming.backpressure.initialRate
未设置,spark.streaming.backpressure.enabled
禁用 默认情况下,所以我想 spark 会尽可能多地使用.
By default spark.streaming.backpressure.initialRate
is not set and spark.streaming.backpressure.enabled
is disabled by default so I suppose spark will take as much as he can.
spark.streaming.backpressure.enabled
:
这使 Spark Streaming 能够控制基于接收速率关于当前批处理调度延迟和处理时间,以便系统只能以系统可以处理的速度接收.在内部,这会动态设置最大接收速率接收器.此比率受值的上限spark.streaming.receiver.maxRate
和spark.streaming.kafka.maxRatePerPartition
如果它们已设置(见下文).
并且由于您想控制第一批,或者更具体地 - 第一批中的消息数量,我认为您需要 spark.streaming.backpressure.initialRate
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate
:
这是每个接收器的初始最大接收速率背压机制开启时接收第一批数据已启用.
当您的 Spark 工作(分别是 Spark 工作人员)能够处理来自 kafka 的 10000 条消息,但 kafka 经纪人为您的工作提供 100000 条消息时,这个是很好的.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
也许您也有兴趣查看 spark.streaming.kafka.maxRatePerPartition
以及 Jeroen van Wilgenburg 在他的博客上.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition
and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.
这篇关于使用 Spark Streaming 时限制 Kafka 批量大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!