调整以下参数,可以大幅度改善Redis集群的稳定性:
为何大压力下要这样调整?
最重要的原因之一Redis的主从复制,两者复制共享同一线程,虽然是异步复制的,但因为是单线程,所以也十分有限。如果主从间的网络延迟不是在0.05左右,比如达到0.6,甚至1.2等,那么情况是非常糟糕的,因此同一Redis集群一定要部署在同一机房内。
这些参数的具体值,要视具体的压力而定,而且和消息的大小相关,比如一条200~500KB的流水数据可能比较大,主从复制的压力也会相应增大,而10字节左右的消息,则压力要小一些。大压力环境中开启appendfsync是十分不可取的,容易导致整个集群不可用,在不可用之前的典型表现是QPS毛刺明显。
这么做的目的是让Redis集群尽可能的避免master正常时触发主从切换,特别是容纳的数据量很大时,和大压力结合在一起,集群会雪崩。
当Redis日志中,出现大量如下信息,即可能意味着相关的参数需要调整了:
22135:M 06 Sep 14:17:05.388 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about e438a338e9d9834a6745c12931950da87e360ca2 22135:M 06 Sep 14:17:07.551 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about d6eb06e9d118c120d3961a659972a1d0191a8652 22135:M 06 Sep 14:17:08.438 # Failover auth granted to f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f for epoch 285 (We can vote for this slave) 有投票资格的node: 1)为master 2)至少有一个slot 3)投票node的epoch不能小于node自己当前的epoch(reqEpoch < curEpoch) 4)node没有投票过该epoch(already voted for epoch) 5)投票node不能为master(it is a master node) 6)投票node必须有一个master(I don't know its master) 7)投票node的master处于fail状态(its master is up) 22135:M 06 Sep 14:17:19.844 # Failover auth denied to 534b93af6ba45a7033dbf38c8f47cd688514125a: already voted for epoch 285 如果一个node又联系上了,则它当是一个slave,或者无slots的master时,直接清除FAIL标志;但如果是一个master,则当“(now - node->fail_time) > (server.cluster_node_timeout * CLUSTER_FAIL_UNDO_TIME_MULT)”时,也清除FAIL标志,定义在cluster.h中(cluster.h:#define CLUSTER_FAIL_UNDO_TIME_MULT 2 /* Undo fail if master is back. */) 22135:M 06 Sep 14:17:29.243 * Clear FAIL state for node d6eb06e9d118c120d3961a659972a1d0191a8652: master without slots is reachable again. 如果消息类型为fail。 22135:M 06 Sep 14:17:31.995 * FAIL message received from f7d6b2c72fa3b801e7dcfe0219e73383d143dd0f about 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6 22135:M 06 Sep 14:17:32.496 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about d7942cfe636b25219c6d56aa72828fcfde2ee261 22135:M 06 Sep 14:17:32.968 # Failover auth granted to 938d9ae2de278938beda1d39185608b02d3b31ec for epoch 286 22135:M 06 Sep 14:17:33.177 # Failover auth granted to d9dadf3342006e2c92def3071ca0a76390be62b0 for epoch 287 22135:M 06 Sep 14:17:36.336 * Clear FAIL state for node 1ba437fa1683a8caafd38ff977e5fbabdaf84fd6: master without slots is reachable again. 22135:M 06 Sep 14:17:36.855 * Clear FAIL state for node d7942cfe636b25219c6d56aa72828fcfde2ee261: master without slots is reachable again. 22135:M 06 Sep 14:17:38.419 * Clear FAIL state for node e438a338e9d9834a6745c12931950da87e360ca2: is reachable again and nobody is serving its slots after some time. 22135:M 06 Sep 14:17:54.954 * FAIL message received from ae8f6e7e0ab16b04414c8f3d08b58c0aa268b467 about 7990d146cece7dc83eaf08b3e12cbebb2223f5f8 22135:M 06 Sep 14:17:56.697 * FAIL message received from 1d07e208db56cfd7395950ca66e03589278b8e12 about fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5 22135:M 06 Sep 14:17:57.705 # Failover auth granted to e1c202d89ffe1c61b682e28071627635974c84a7 for epoch 288 22135:M 06 Sep 14:17:57.890 * Clear FAIL state for node 7990d146cece7dc83eaf08b3e12cbebb2223f5f8: slave is reachable again. 22135:M 06 Sep 14:17:57.892 * Clear FAIL state for node fbe774cdbd2acd24f9f5ea90d61c607bdf800eb5: master without slots is reachable again. |