问题描述
Kafka 有一个同步副本集的概念,它是一组不落后于领导者太远的节点.
Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
如果网络干净地分区,使得包含领导者的少数在一侧,而在另一侧包含其他同步节点的多数会发生什么?
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
少数派/领导者方大概认为自己丢失了一堆节点,相应地减小了 ISR 大小,并愉快地进行了下去.
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
另一方可能认为它失去了领导者,所以它选举了一个新的并愉快地继续.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
现在我们在同一个集群中有两个领导者,独立接受写入.在一个需要大多数节点在分区后继续进行的系统中,旧的领导者将下台并停止接受写入.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
在 Kafka 这种情况下会发生什么?更改 ISR 集是否需要多数票?如果是这样,是否有短暂的数据丢失,直到领导端检测到中断?
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
推荐答案
我还没有测试过这个,但我认为接受的答案是错误的,而且 Lars Francke 关于大脑分裂的可能性是正确的.
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum 需要多数,所以如果 ZK ensemble 分区,至多一侧会有一个 quorum.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
作为控制器需要与 ZK(临时 znode 注册)进行活动会话.如果当前控制器与 ZK quorum 分开,它应该自动停止将自己视为控制器.这最多需要 zookeeper.session.timeout.ms = 6000
.仍然连接到 ZK quorum 的 Broker 应该在他们之间选举一个新的控制器.(基于此:https://stackoverflow.com/a/52426734)
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000
. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
成为主题分区领导者还需要与 ZK 进行积极的会话.失去与 ZK quorum 连接的领导者应该自愿停止成为其中之一.Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
现在,分区的前领导者在 ZK 超时窗口期间收到的生产者请求会发生什么?有一些可能性.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
如果producer的acks = all
和topic的min.insync.replicas = replication.factor
,那么所有的ISR应该有完全相同的数据.前领导者最终将拒绝正在进行的写入,生产者将重试它们.The newly elected leader will not have lost any data.另一方面,在分区恢复之前,它将无法处理任何写入请求.由生产者决定是拒绝客户端请求还是在后台继续重试一段时间.
If producer's acks = all
and topic's min.insync.replicas = replication.factor
, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
否则,新领导者很可能会丢失多达 zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000
条记录,他们将分区恢复后从前领导者那里截断.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000
worth of records and they will be truncated from the ex-leader after the partition heals.
假设您期望的网络分区比您在只读状态下所接受的要长.
Let's say you expect longer network partitions than you are comfortable with being read-only.
这样的事情可以工作:
- 您有 3 个可用区,并希望最多 1 个可用区与其他 2 个区分开
- 在每个区域中,您都有一个 Zookeeper 节点(或几个),因此 2 个区域的组合始终可以占多数
- 在每个区域中,您都有一堆 Kafka 经纪人
- 每个主题有
replication.factor = 3
,每个可用区有一个副本,min.insync.replicas = 2
- 生产者的
acks = all
- you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
- in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
- in each zone you have a bunch of Kafka brokers
- each topic has
replication.factor = 3
, one replica in each availability zone,min.insync.replicas = 2
- producers'
acks = all
这样在网络分区的 ZK 仲裁端应该有两个 Kafka ISR,其中至少一个与前领导者完全同步.因此,代理上不会丢失数据,并且可供任何仍然能够连接到获胜方的生产者写入.
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
这篇关于kafka 如何处理网络分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!