问题描述
我们正在运行一个kafka流应用程序,但遇到了一个奇怪的问题.我们同时使用全局状态存储和其他多个状态存储.
We are running a kafka streams application and stuck with a strange problem. We are using both global state store and multiple other state stores.
我们的应用程序已加载所有数据,并且状态存储中现在包含大量信息.现在,当我们尝试关闭应用程序并将其重新带回(某些配置更改)时,它将进入无休止的重新平衡..为了验证我们是否还原了配置更改,但仍停留在该阶段.没有错误等等
Our application has loaded all the data and state stores has good amount of information in it now. Now, when we tried to bring down the application and bring it back again (some config changes), it is going into endless rebalancing .. To verify we reverted back config changes, but it it still stuck in that stage. There are no erros, etc
INFO o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] Started Streams client
INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from RUNNING to PARTITIONS_REVOKED
INFO o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] State transition from RUNNING to REBALANCING
INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition revocation took 1 ms.
suspended active tasks: []
suspended standby tasks: []
INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from RUNNING to PARTITIONS_REVOKED
INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition revocation took 0 ms.
suspended active tasks: []
suspended standby tasks: []
04:02:13.682 6985 [main] INFO com..... - Started Application in 6.647 seconds (JVM running for 7.484)
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition assignment took 28 ms.
current active tasks: [0_0, 1_0, 2_0, 3_0, 4_0, 5_0, 6_0, 7_5, 8_5, 9_5, 10_5, 12_4, 13_4, 14_4, 15_4, 16_4, 17_4, 19_3, 20_3, 21_3, 22_3, 23_3, 24_3, 25_3, 29_0]
current standby tasks: [0_2]
previous active tasks: []
04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition assignment took 28 ms.
current active tasks: [0_3, 1_3, 2_3, 3_3, 4_3, 5_3, 7_2, 8_2, 9_2, 10_2, 12_1, 13_1, 14_1, 15_1, 16_1, 17_1, 19_0, 20_0, 21_0, 22_0, 23_0, 24_0, 25_0, 26_0]
current standby tasks: [0_5]
previous active tasks: []
04:03:47.602 100905 [http-nio-8080-exec-10] INFO c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:49.356 102659 [http-nio-8080-exec-2] INFO c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:51.600 104903 [http-nio-8080-exec-3] INFO c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:53.356 106659 [http-nio-8080-exec-4] INFO c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
Number of topics - 100
Partitions per topic - 6. (7 topics with 1 partition only)
kubernetes env - 3 pods ( 2 stream threads )
当我们尝试使用以下命令列出消费者组
When we try to list consumer group using following command
root@bastion-0:/app/confluent-5.2.2/bin# ./kafka-consumer-groups --describe --group app --bootstrap-server kafka-0..local:9094 --command-config /app/client-sasl-ssl.properties --members
CONSUMER-ID HOST CLIENT-ID #PARTITIONS
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer-3b370697-e737-411c-af28-fb04cfbae1dd 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer 45
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer-3edb3e5f-9f1a-499f-8732-6cd2c8b96c96 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer 45
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer-00e24df4-5669-4e2c-a775-8f6c4f689714 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer 46
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer-1b6b2955-5dfd-4be7-8ad9-9f1b54fe6310 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer 45
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer-72cd0319-8ca7-493c-891d-3022b235ea01 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer 45
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer-c1a16d64-8d49-4758-ab64-2af3cd9aef0f 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer 45
以上命令的输出一直在变化-从0到某个可变数字.理想情况下,它应该在一段时间后变得稳定.
The output of the above command keeps on changing - from 0 to some variable number. Ideally it should become stable after some time.
kafka流平衡(重新平衡)是否有任何可调参数/配置
Are there any tunables/configs for kafka streams balancing (rebalancing)
问题:
-
是什么导致应用程序在启动时无休止地重新平衡(即使没有错误/异常等).
What causes application to rebalance endlessly while starting (even though there are no errors/exception, etc).
是否有任何可调参数可以帮助我们避免重新平衡?
Is there any tunables which can help us avoid rebalancing ?
推荐答案
看看您添加的日志,使用者pod正在启动,因此我想也许其他2个pod正在滚动重启,因此重新平衡每次停下来又开始一次.
Looking at the logs you have added, the consumer pod is starting up and so I guess maybe there is a rolling restart of the other 2 pods and hence a rebalance each time one stops and one starts.
尽管Kafka运行很快,但重新平衡不是很快,因为在此过程中整个组之间都有聊天-尽管可以将分区分配给一个使用方,但只有在所有使用方都已分配了所有使用方后,该组才开始使用,并且发现了分配仅在轮询方法内发生(请参阅 https ://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html ).
Although Kafka is fast when running rebalance is not fast as there is chat across the group during the process - although partitions may be assigned to one consumer, the group only starts consuming when all consumers have had their assignment, and the discovery of assignment only happens within the poll method (see https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html).
因此,加快处理速度的方法是更频繁地轮询,以便您可以更快地了解更改,但是这是一个折衷方案-如果在正常运行中主题不忙,那么将会有很多旋转什么也不做.
Hence the way to speed up the process is to poll more frequently so that you get to hear about changes quicker, but there is a trade off - if in normal running the topics are not busy then there will be a lot of spinning doing nothing.
但是,您不确定自己的意思是无休止的.如果您的意思是应用程序实际上只是在重新平衡,那么请参阅上面的评论.可能是Pod连续不断地上下波动(心跳减弱),或者轮询花费了很长时间-您是否为每个记录执行大量I/O?从日志和容器名称中可以很明显地看到重新启动.过多的轮询还会导致警告消息,提示您增加max.poll.interval.ms
或减少max.poll.records
However, you are not quite clear on what you mean by endlessly. If you mean that the application is literally only rebalancing then see my comment above. It may be that pods are going up and down continuously (heartbeats dying) or else polling is taking a long time - are you doing a lot of I/O for each record? Restarts would be obvious from the logs and the pod names. Excessive polling would also cause warning messages suggesting you either increase max.poll.interval.ms
or reduce max.poll.records
这篇关于Kafka Streams应用程序无休止的重新平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!