前言

1.故障现象

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

2.故障处理

2.1 单节点启动

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

--登陆数据库,业务恢复正常
[omm@node1 ~]$ gsql -d postgres

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

2.2 全库备份

--全库数据量查询
SELECT d.datname as "Name",
       pg_catalog.pg_get_userbyid(d.datdba) as "Owner",
       pg_catalog.pg_encoding_to_char(d.encoding) as "Encoding",
       d.datcollate as "Collate",
       d.datctype as "Ctype",
			 d.datacl AS "Access privileges",
       --pg_catalog.array_to_string(d.datacl, E'\n') AS "Access privileges",
       CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
            THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
            ELSE 'No Access'
       END as "Size",
       t.spcname as "Tablespace",
       pg_catalog.shobj_description(d.oid, 'pg_database') as "Description"
FROM pg_catalog.pg_database d
  JOIN pg_catalog.pg_tablespace t on d.dattablespace = t.oid
-- where d.datname = 'database_name'
ORDER BY 1;

--全库备份
gs_dumpall -f /home/omm/bkpall_20240607.sql -p 5432

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

3.备机重建

主机:

gs_guc set -D /u01/opengauss/data/db -c "replconninfo1='localhost=主机ip localport=port+1 localheartbeatport=port+4 localservice=port+5 remotehost=备机IP remoteport=port+1 remoteheartbeatport=port+4 remoteservice=port+5'"
gs_guc set -D /u01/opengauss/data/db -c 'remote_read_mode=off';
gs_guc set -D /u01/opengauss/data/db -c 'replication_type=1';
gs_guc set -D /u01/opengauss/data/db -h "host all omm 主机ip/32 trust"        
gs_guc set -D /u01/opengauss/data/db -h "host all omm 备机IP/32 trust" 
gs_guc set -D /u01/opengauss/data/db -c "port=主机端口"
gs_guc set -D /u01/opengauss/data/db -c "listen_addresses='主机ip'"


备机:
gs_guc set -D /u01/opengauss/data/db -c "replconninfo1='localhost=备机ip localport=port+1 localheartbeatport=port+4 localservice=port+5 remotehost=主机IP remoteport=port+1 remoteheartbeatport=port+4 remoteservice=port+5'"
gs_guc set -D /u01/opengauss/data/db -c 'remote_read_mode=off';
gs_guc set -D /u01/opengauss/data/db -c 'replication_type=1';
gs_guc set -D /u01/opengauss/data/db -h "host all omm 主机ip/32 trust"        
gs_guc set -D /u01/opengauss/data/db -h "host all omm 备机IP/32 trust" 
gs_guc set -D /u01/opengauss/data/db -c "port=备机端口"
gs_guc set -D /u01/opengauss/data/db -c "listen_addresses='备机IP'"

主机启动:
gs_ctl start -D /u01/opengauss/data/db -M primary

备机启动
gs_ctl start -D /u01/opengauss/data/db -M standby
gs_ctl build -D /u01/opengauss/data/db -M standby -b full

4.cm_ctl集群工具

突发!凌晨4点某制造业大厂国产数据库集群故障...-LMLPHP

5.总结

06-08 22:52