设置:具有 5 个节点的副本集,版本 3.4.5。
尝试使用 rs.stepDown(60, 30) 切换 PRIMARY 但始终出现错误:
rs0:PRIMARY> rs.stepDown(60, 30)
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2017-07-11T00:21:11.205+0000. Please use {force: true} to force node to step down.",
"code" : 50,
"codeName" : "ExceededTimeLimit"
}
但是,在并行终端中运行的 rs.printSlaveReplicationInfo() 确认所有副本都已完全 catch :
rs0:PRIMARY> rs.printSlaveReplicationInfo()
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
难道我做错了什么?
UPD: 我在
rs.stepDown
之前和期间检查了长时间运行的操作,如下所示,它看起来像这样:# Before rs.stepDown
$ watch "mongo --quiet --eval 'JSON.stringify(db.currentOp())' | jq -r '.inprog[] | \"\(.secs_running) \(.desc) \(.op)\"' | sort -rnk1"
984287 rsSync none
984287 ReplBatcher none
67 WT RecordStoreThread: local.oplog.rs none
null SyncSourceFeedback none
null NoopWriter none
0 conn615153 command
0 conn614948 update
0 conn614748 getmore
...
# During rs.stepDown
984329 rsSync none
984329 ReplBatcher none
108 WT RecordStoreThread: local.oplog.rs none
16 conn615138 command
16 conn615136 command
16 conn615085 update
16 conn615079 insert
...
基本上,长时间运行的用户操作似乎是
rs.stepDown()
的结果,因为一旦 secs_running
尝试切换并一直增长直到 PRIMARY
失败,stepDown
就会变为非零。然后一切恢复正常。关于为什么会发生这种情况以及这是否正常的任何想法?
最佳答案
我已经使用以下命令降级到二级
db.adminCommand( { replSetStepDown: 120, secondaryCatchUpPeriodSecs: 15, force: true } )
您可以在下面的 mongodb 官方文档中找到它
https://docs.mongodb.com/manual/reference/command/replSetStepDown/
关于MongoDB 主要 stepDown 不成功,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/45023574/