问题描述
我使用MongoDB 3.4.3,并且在一个副本集中拥有三台计算机.使其名称为server1
,server2
和server3
. server2
处于恒定的回滚状态,因此我们将其关闭. server3
处于恢复状态,并尝试从server1
获取操作日志,但是其尝试导致ExceededTimeLimit异常.因此,这是从server3
日志中摘录的:
I use MongoDB 3.4.3 and have three machines in one replica set. Let its names as server1
, server2
and server3
. server2
is in a constant rollback state, so we turned it off. server3
is in recovering state and tries to get oplog from server1
but its attempts result in ExceededTimeLimit exception. So this is an extract from the server3
log:
2017-06-26T14:42:14.442+0300 I REPL [replication-0] could not find member to sync from
2017-06-26T14:42:24.443+0300 I REPL [rsBackgroundSync] sync source candidate: server1:27017
2017-06-26T14:42:24.444+0300 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to server1:27017
2017-06-26T14:42:24.455+0300 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to server1:27017
2017-06-26T14:42:54.459+0300 I REPL [replication-0] Blacklisting server1:27017 due to required optime fetcher error: 'ExceededTimeLimit: Operation timed out, request was RemoteCommand 191739 -- server1:27017 db:local expDate:2017-06-26T14:42:54.459+0300 cmd:{ find: "oplog.rs", oplogReplay: true, filter: { ts: { $gte: Timestamp 1497975676000|310, $lte: Timestamp 1497975676000|310 } } }' for 10s until: 2017-06-26T14:43:04.459+0300. required optime: { ts: Timestamp 1497975676000|310, t: 20 }
因此检索操作日志的这些attepms是无限的.根据db.currentOp()
,在server1
(副本集的主副本)上有长时间运行的查询试图检索操作日志.这些查询降低了server1
的性能,因此我的数据库工作非常非常慢.
So these attepms to retrieve oplog are infinite. According to db.currentOp()
there are a log of long running queries on the server1
(the primary of the replica set) trying to retrieve the oplog. These queries descreases perfomance of server1
, so my database works very very slow.
当前server1
的操作日志大小为643 GB.我认为其大小是复制无法正常工作的原因. server2
也存在oplog超时问题,因此我们暂时将其关闭.这项裁定已经持续了超过一周的时间.我在主计算机上有超过5 TB的数据.如何还原副本集?
The current server1
's oplog size is 643 GB. I think its size is the reason why the replication doesn't work. server2
had had oplog timeout issues as well, so we turned it off temporarily. This sutiation has been lasting for more than week. I have more than 5 TB of data on the primary machine. How can I restore the replica set?
更新:我们的服务器每个都有64 GB的内存.确实是虚拟机.
upd: Our servers have 64 GB of memory each. It's virtual machines indeed.
推荐答案
您可以停工吗?因为看起来您的计算机(server1)没有足够的内存.使用5TB数据和那么大的opLog,所需的内存量为数百GB.我不会尝试将该系统作为一个副本集运行.更像是3-5个分片群集(总共9-15个节点;每个分片3个副本集).好的规则是将节点大小始终控制在2TB以下,如果可以存档,则最好将1TB作为起点.
Can you have downtime? Because it looks like that your machine (server1) don't have enough memory. With 5TB data and that big opLog, needed memory amount is hundreds of GB. I would not try to run that system as one replica set. More like 3-5 shards cluster (totally 9-15 nodes; replica set of 3 for every shard). Good rule is keep node size always under 2TB and 1TB is good starting point if you can archive that.
如果可能会出现停机,则应将opLog缩小到更合理的大小.您可以从50GB开始.可以在此处找到步骤.
If you can have downtime, you should shrink your opLog to more reasonable size. You could start with 50GB. Steps can be found here.
这篇关于MongoDB复制超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!