网络故障后RabbitMQ集群未重新连接

本文介绍了网络故障后RabbitMQ集群未重新连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个生产中有两个节点的RabbitMQ集群，并且该集群因以下错误消息而中断：

I have a RabbitMQ cluster with two nodes in production and the cluster is breaking with these error messages:

=ERROR REPORT==== 23-Dec-2011::04:21:34 ===
** Node rabbit@rabbitmq02 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 23-Dec-2011::04:21:35 ===
node rabbit@rabbitmq02 lost 'rabbit'

=ERROR REPORT==== 23-Dec-2011::04:21:49 ===
Mnesia(rabbit@rabbitmq01): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@rabbitmq02}

我试图通过使用 tcpkill杀死两个节点之间的连接来模拟该问题。群集已断开连接，并且令人惊讶的是两个节点没有尝试重新连接！

I tried to simulate the problem by killing the connection between the two nodes using "tcpkill". The cluster has disconnected, and surprisingly the two nodes are not trying to reconnect!

当群集中断时，HAProxy负载均衡器仍将两个节点都标记为活动状态并向两个节点发送请求他们不在群集中。

When the cluster breaks, HAProxy load balancer still marks both nodes as active and send requests to both of them, although they are not in a cluster.

我的问题：

如果将节点配置为以群集中，当我出现网络故障时，为什么他们不尝试在以后重新连接？

If the nodes are configured to work as a cluster, when I get a network failure, why aren't they trying to reconnect afterwards?

如何识别损坏的群集并关闭节点之一？分别使用两个节点时，我遇到一致性问题。

How can I identify broken cluster and shutdown one of the nodes? I have consistency problems when working with the two nodes separately.

推荐答案

从这种故障中恢复的另一种方法是与Mnesia一起使用，Mnesia是RabbitMQ用作持久性机制的数据库，并由其控制RabbitMQ实例（以及主/从状态）的同步。有关所有详细信息，请参见以下URL：

One other way to recover from this kind of failure is to work with Mnesia which is the database that RabbitMQ uses as the persistence mechanism and for the synchronization of the RabbitMQ instances (and the master / slave status) are controlled by this. For all the details, refer to the following URL: http://www.erlang.org/doc/apps/mnesia/Mnesia_chap7.html

在此处添加相关部分：

一种情况是Mnesia已经启动并运行，并且Erlang节点再次获得
联系。然后，Mnesia将尝试与另一个
节点上的Mnesia联系，以查看是否还认为该网络已被分区
一段时间。如果两个节点上的Mnesia彼此都记录了mnesia_down条目
，则Mnesia会生成一个称为
的系统事件{inconsistent_database，running_partitioned_network，Node}，该事件是
发送给Mnesia的事件处理程序和其他可能的订阅者。
默认事件处理程序将错误报告给错误记录器。

One is when Mnesia already is up and running and the Erlang nodes gain contact again. Then Mnesia will try to contact Mnesia on the other node to see if it also thinks that the network has been partitioned for a while. If Mnesia on both nodes has logged mnesia_down entries from each other, Mnesia generates a system event, called {inconsistent_database, running_partitioned_network, Node} which is sent to Mnesia's event handler and other possible subscribers. The default event handler reports an error to the error logger.

Mnesia可能检测到由于$ a $ b通讯故障，正在启动。如果Mnesia
检测到本地节点和另一个节点都从彼此接收了mnesia_down
，则它会生成{inconsistent_database，
starting_partitioned_network，Node}系统事件，并按照上述
的方式进行操作。

Another occasion when Mnesia may detect that the network has been partitioned due to a communication failure, is at start-up. If Mnesia detects that both the local node and another node received mnesia_down from each other it generates a {inconsistent_database, starting_partitioned_network, Node} system event and acts as described above.

如果应用程序检测到通信故障
可能导致数据库不一致，则可以使用
函数mnesia：set_master_nodes （Tab，Nodes）来确定每个表可以从中加载
个节点。

If the application detects that there has been a communication failure which may have caused an inconsistent database, it may use the function mnesia:set_master_nodes(Tab, Nodes) to pinpoint from which nodes each table may be loaded.

在启动时，Mnesia的常规表加载算法将被绕过，而
表将从为
表定义的主节点之一加载，而不考虑日志中可能的mnesia_down条目。
节点只能包含表具有副本的节点，并且如果
为空，则将重置特定表
的主节点恢复机制，并将使用常规加载机制当下一个
重新启动时。

At start-up Mnesia's normal table load algorithm will be bypassed and the table will be loaded from one of the master nodes defined for the table, regardless of potential mnesia_down entries in the log. The Nodes may only contain nodes where the table has a replica and if it is empty, the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used when next restarting.

函数mnesia：set_master_nodes（Nodes）设置所有
表的主节点。对于每个表，它将确定其副本节点并调用
mnesia：set_master_nodes（Tab，TabNodes），并将那些
包括在节点列表中的副本节点（即TabNodes是
节点的交集）和表的副本节点）。如果交集为
为空，则将重置特定表的主节点恢复机制，并在下次重新启动时使用正常的加载机制。

The function mnesia:set_master_nodes(Nodes) sets master nodes for all tables. For each table it will determine its replica nodes and invoke mnesia:set_master_nodes(Tab, TabNodes) with those replica nodes that are included in the Nodes list (i.e. TabNodes is the intersection of Nodes and the replica nodes of the table). If the intersection is empty the master node recovery mechanism for the particular table will be reset and the normal load mechanism will be used at next restart.

函数mnesia：system_info（master_node_tables）和
mnesia：table_info（Tab，master_nodes）可用于获取有关潜在主节点的信息
。

The functions mnesia:system_info(master_node_tables) and mnesia:table_info(Tab, master_nodes) may be used to obtain information about the potential master nodes.

确定在通讯失败后要保留哪些数据超出了Mnesia的范围。一种方法是确定哪个岛
包含大部分节点。对
个关键表使用{majority，true}选项可以确保不属于多数岛的
的节点无法更新这些表。请注意，这会减少少数节点上的服务。该
是一个折衷方案，是为了获得更高的一致性保证。

Determining which data to keep after communication failure is outside the scope of Mnesia. One approach would be to determine which "island" contains a majority of the nodes. Using the {majority,true} option for critical tables can be a way of ensuring that nodes that are not part of a "majority island" are not able to update those tables. Note that this constitutes a reduction in service on the minority nodes. This would be a tradeoff in favour of higher consistency guarantees.

函数mnesia：force_load_table（Tab）可用于强制加载
，

The function mnesia:force_load_table(Tab) may be used to force load the table regardless of which table load mechanism is activated.

这是从这种故障中恢复的更冗长和复杂的方法。，但是可以提供更好的粒度和对最终主节点中应提供的数据的控制（这可以减少合并 RabbitMQ主节点时可能发生的数据丢失量。）

这篇关于网络故障后RabbitMQ集群未重新连接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！