环境:
Oracle 10gAIX 6.1
HACMP 6
现象:
1.数据库Down。
2.alert日志中频繁报ORA-04031: unable to allocate 4120 bytes of shared memory("shared pool",...)错误。
3.数据库每晚的逻辑导出报错:
TESTDB:/expbackup#>tail -200 backehr140322.log
EXP-00056: ORACLE error 1034 encountered
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
IBM AIX RISC System/6000 Error: 2: No such file ordirectory
EXP-00005: all allowable logon attempts failed
EXP-00000: Export terminated unsuccessfully
即使是手动发起备份脚本,也是报同样的错误,备份无法进行,但是每晚的RMAN定时备份能正常进行,每天的备份集状态正常。
分析:
Oracle的解释:
[oracle@TESTDB ~]$ oerr ORA 4031
04031, 00000, "unable to allocate %s bytes of shared memory (\"%s\",\"%s\",\"%s\",\"%s\")"
// *Cause: More shared memory is needed than was allocated in the shared
// pool or Streams pool.
// *Action: If the shared pool is out of memory, either use the
// DBMS_SHARED_POOL package to pin large packages,
// reduce your use of shared memory, or increase the amount of
// available shared memory by increasing the value of the
// initialization parameters SHARED_POOL_RESERVED_SIZE and
// SHARED_POOL_SIZE.
// If the large pool is out of memory, increase the initialization
// parameter LARGE_POOL_SIZE.
// If the error is issued from an Oracle Streams or XStream process,
// increase the initialization parameter STREAMS_POOL_SIZE or increase
// the capture or apply parameter MAX_SGA_SIZE.
查阅了一些Oracle的资料,发现ORA-04031错误一般可能由于两个原因引起的:
1.内存中存在大量碎片,导致在分配内存的时候,没有连续的内存可存放,此问题一般是需要在开发的角度上入手,比如增加绑定变量,减少应解析来改善和避免;
2.内存容量不足,需要扩大内存。
Alert Log在数据库Down之前的报了很多ORA-04031错误:
Sat Mar 22 19:00:09 2014
Errors in file /oracle/admin/oratest/bdump/oratest_j001_233886.trc:
ORA-12012: error on auto execute of job 8950
ORA-04031: unable to allocate ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,password,datats#...","Typecheck","kgghteInit")
ORA-06512: at "SYS.DBMS_SPACE", line 3289
ORA-06512: at line 1
ORA-00604: error occurred at recursive SQL level 5
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select name,password,datats#...","Typecheck","kgghteInit")
ORA-06512: at "SYS.PRVT_ADVISOR", line 1624
ORA-06512: at "SYS.DBMS_ADVISOR", line 186
ORA-06512: at "SYS.DBMS_SPACE", line 1500
ORA-06512: at "SYS.DBMS_SPACE", line 1566
bytes of shared memory ("","","","")
Sat Mar 22 20:28:28 2014
Errors in file /oracle/admin/oratest/bdump/orahtest_j000_278630.trc:
ORA-12012: error on auto execute of job 42781
ORA-04031: unable to allocate ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select x.inst_id,x.indx+1,ks...","Typecheck","kgghtInit")
ORA-06512: at "SYS.DBMS_ISCHED", line 3047
ORA-06512: at "SYS.DBMS_SCHEDULER", line 1926
ORA-06512: at "EXFSYS.DBMS_RLMGR_DR", line 3150
ORA-01403: no data found
ORA-06512: at line 1
bytes of shared memory ("","","","")
Sat Mar 22 20:55:46 2014
Errors in file /oracle/admin/oratest/bdump/oratest_reco_319580.trc:
ORA-04031: unable to allocate 4120 bytes of shared memory ("shared pool","select host,userid,password,...","Typecheck","kgghteInit")
Sat Mar 22 20:55:46 2014
RECO: terminating instance due to error 4031
Instance terminated by RECO, pid = 319580
其中重点查看导致Instance Down掉的这条日志,是由于进程号为319580的RECO进程导致的,提示是无法在shared pool中分配4120bytes的内存块,具体的信息需要分析一下数据库的跟踪文件:orahr_reco_319580.trc,在这里,不要用tkprof之类的工具进行处理,直接用文本编辑器打开分析,其中比较值得注意的地方有:
==============================
Memory Utilization of Subpool 1
================================
Allocation Name Size
_________________________ __________
"free memory " 4306672
"miscellaneous " 304
"Undo Meta Data " 144
"dpslut_kfdsg " 512
"File Space Usage " 11336
"trace buffer " 950272
"trace_knlasg " 504
由此看到,虽然free memory并不大,但是其确实大于报错中提到的4120bytes,由此看来,有可能引起该报错的原因有两个,就是上文中提到的:
1.确实有碎片;2.内存可能不足。
在取数据分析的时候,正好是上班时间,因此只能通过一些临时手段进行该问题的处理,比较有效的方法是:
刷共享池:altersystem flush shared_pool;
作此操作一定要注意!千万不要在业务繁忙的时候做,很有可能引起宕机!可以在中午或者是下班时刻执行,执行之前,如果有条件,最好做一次全备,不论是逻辑导出还是RMAN备份。
内存不足这个原因需要做进一步的分析,用数据说话,不能单凭感觉和经验。
跟客户和相关负责人了解了一下情况,得知客户在不久前刚进行过一次数据库的迁移,我感觉可能是由于在迁移过程中有地方疏忽造成的隐患。
由于之前帮客户处理数据库问题的时候,抓过数据库的各种信息,所以手头上有当时数据库的详细状态记录;经查看parameter后,发现,有很多参数跟迁移前不同,其中sga_target和sga_max_size这两个值比以前小几十倍。
因此初步断定,当前的内存确实可能较之前比非常不足,跟客户反映后,客户同意无业务时间进行参数修改和重启数据库的工作。
于是在下班后,将数据库参数修改,命令如下:
alter system setsga_max_size=5000M scope=spfile;
alter system setsga_target=5000M scope=spfile;
重启数据库后,手动发起逻辑备份脚本,发现执行成功!
待做完全备以后,将数据库切换到HACMP的另一节点,修改参数,重启,导出均成功后,确定操作有效,判断正确。
至此为止基本的处理已经告一段落,但是问题的解决要除根!具体是因为哪个环节导致的这个隐患的存在一定要找出来,否则每一次迁移都要承受数据库宕机之苦。
经调查,发现当初迁移的时候,是将参数文件通过create pfile from spfile生成的,然后拷贝到另一台机器上进行的startup from pfile='***'。
后来打开这个pfile一看,里面只有短短的10几行,但是在数据库迁移前后我发现有20多个参数是有区别的,因此判断是在恢复操作中很多没有明文声明的参数可能是按照默认的值设定的,但是这些参数在迁移之前可能是有人为了适应应用环境手动调整过的。
因此,恢复参数文件的方法是存在问题的,单纯的移动到目的地,然后通过startup pfile='***'看来是会导致很多参数无法复原到之前的状态。
在此,我建议参数文件用Oracle RMAN去恢复,因此在备份的时候,一定要注意备份相关的控制文件和参数文件,备份相关的操作演示请参阅:RMAN备份、恢复实验室 之 备份篇 【备份实验】
恢复演示如下:
1.启动实例并恢复参数文件:
RMAN> startup nomount;
startup failed: ORA-01078: failure in processing system parameters
LRM-00109: could not open parameter file '/oracle/app/oracle/product/11.2.0/db_1/dbs/initorcl.ora'
starting Oracle instance without parameter file for retrieval of spfile
Oracle instance started
Total System Global Area 158662656 bytes
Fixed Size 2211448 bytes
Variable Size 92275080 bytes
Database Buffers 58720256 bytes
Redo Buffers 5455872 bytes
RMAN> restore spfile from '/oracle/app/oracle/product/11.2.0/db_1/dbs/c-1347976258-20140318-01';
Starting restore at 18-MAR-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=96 device type=DISK
channel ORA_DISK_1: restoring spfile from AUTOBACKUP /oracle/app/oracle/product/11.2.0/db_1/dbs/c-1347976258-20140318-01
channel ORA_DISK_1: SPFILE restore from AUTOBACKUP complete
Finished restore at 18-MAR-14
2.设置DBID:
RMAN> set dbid=1347976258
executing command: SET DBID
3.恢复控制文件:
RMAN> restore controlfile from '/oracle/app/oracle/product/11.2.0/db_1/dbs/c-1347976258-20140318-01';
Starting restore at 18-MAR-14
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=125 device type=DISK
channel ORA_DISK_1: restoring control file
channel ORA_DISK_1: restore complete, elapsed time: 00:00:08
output file name=/oracle/app/oradata/ORCL/controlfile/o1_mf_8xm4g27m_.ctl
Finished restore at 18-MAR-14
4.将数据库置于MOUNT状态:
RMAN> alter database mount;
database mounted
released channel: ORA_DISK_1
5.查看备份集信息:
由于控制文件中有备份信息,因此恢复控制文件后便可看到具体的备份信息:
List of Backups
===============
Key TY LV S Device Type Completion Time #Pieces #Copies Compressed Tag
------- -- -- - ----------- --------------- ------- ------- ---------- ---
37 B 0 A DISK 02-AUG-13 1 1 NO TAG20130802T054625
38 B 0 A DISK 02-AUG-13 1 1 NO TAG20130802T054625
39 B 1 A DISK 02-AUG-13 1 1 NO TAG20130802T055026
40 B 1 A DISK 02-AUG-13 1 1 NO TAG20130802T055026
41 B 1 A DISK 18-OCT-13 1 1 NO TAG20131018T005620
42 B 1 A DISK 18-OCT-13 1 1 NO TAG20131018T005620
43 B A A DISK 19-NOV-13 1 1 NO TAG20131119T013353
44 B 0 A DISK 19-NOV-13 1 1 NO TAG20131119T013521
45 B 0 A DISK 19-NOV-13 1 1 NO TAG20131119T013521
46 B A A DISK 19-NOV-13 1 1 NO TAG20131119T013801
47 B A A DISK 07-FEB-14 1 1 NO TAG20140207T002846
48 B 1 A DISK 07-FEB-14 1 1 NO TAG20140207T003003
49 B A A DISK 07-FEB-14 1 1 NO TAG20140207T003109
50 B F A DISK 07-FEB-14 1 1 NO TAG20140207T003110
51 B A A DISK 07-FEB-14 1 1 NO TAG20140207T011201
52 B 1 A DISK 07-FEB-14 1 1 NO TAG20140207T011203
53 B A A DISK 07-FEB-14 1 1 NO TAG20140207T011249
54 B F A DISK 07-FEB-14 1 1 NO TAG20140207T011250
55 B A A DISK 07-FEB-14 1 1 NO TAG20140207T013906
56 B 0 A DISK 07-FEB-14 1 1 NO TAG20140207T013908
57 B A A DISK 07-FEB-14 1 1 NO TAG20140207T014045
58 B F A DISK 07-FEB-14 1 1 NO TAG20140207T014046
59 B 1 A DISK 10-MAR-14 1 1 NO ZHANGZICHAO_STUDY
60 B F A DISK 10-MAR-14 1 1 NO TAG20140310T035229
61 B A A DISK 18-MAR-14 1 1 NO TAG20140318T032915
62 B 1 A DISK 18-MAR-14 1 1 NO TAG20140318T033619
63 B A A DISK 18-MAR-14 1 1 NO TAG20140318T034112
64 B F A DISK 18-MAR-14 1 1 NO TAG20140318T034114
65 B F A DISK 18-MAR-14 1 1 NO TAG20140318T034213
具体查看备份集的方法有很多,请参阅: RMAN备份、恢复实验室 之 备份篇 【查看备份情况】
确认备份集都在之后,进行数据库恢复即可!
6. Restore Database:
RMAN> restore database;
Starting restore at 18-MAR-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=133 device type=DISK
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to /oracle/app/oradata/ORCL/datafile/o1_mf_system_9h7jv8g1_.dbf
channel ORA_DISK_1: restoring datafile 00002 to /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_8xm483f7_.dbf
channel ORA_DISK_1: restoring datafile 00003 to /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_8xm483gd_.dbf
channel ORA_DISK_1: restoring datafile 00004 to /oracle/app/oradata/ORCL/datafile/o1_mf_users_9ksobds9_.dbf
channel ORA_DISK_1: restoring datafile 00005 to /oracle/app/oradata/ORCL/datafile/o1_mf_example_9h7hct57_.dbf
channel ORA_DISK_1: restoring datafile 00006 to /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/1rp002ls_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/1rp002ls_1_1 tag=TAG20140207T013908
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:06:39
Finished restore at 18-MAR-14
RMAN> recover database;
Starting recover at 18-MAR-14
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00002: /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_9lgs3p0k_.dbf
destination for restore of datafile 00003: /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_9lgs3pmp_.dbf
destination for restore of datafile 00004: /oracle/app/oradata/ORCL/datafile/o1_mf_users_9lgs3q3d_.dbf
destination for restore of datafile 00005: /oracle/app/oradata/ORCL/datafile/o1_mf_example_9lgs3p6t_.dbf
destination for restore of datafile 00006: /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/1up2pv13_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/1up2pv13_1_1 tag=ZHANGZICHAO_STUDY
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:11
channel ORA_DISK_1: starting incremental datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /oracle/app/oradata/ORCL/datafile/o1_mf_system_9lgs3ov6_.dbf
destination for restore of datafile 00002: /oracle/app/oradata/ORCL/datafile/o1_mf_sysaux_9lgs3p0k_.dbf
destination for restore of datafile 00003: /oracle/app/oradata/ORCL/datafile/o1_mf_undotbs1_9lgs3pmp_.dbf
destination for restore of datafile 00004: /oracle/app/oradata/ORCL/datafile/o1_mf_users_9lgs3q3d_.dbf
destination for restore of datafile 00005: /oracle/app/oradata/ORCL/datafile/o1_mf_example_9lgs3p6t_.dbf
destination for restore of datafile 00006: /oracle/app/oradata/catalog_rman/rmantbs01.dbf
channel ORA_DISK_1: reading from backup piece /oracle/app/oracle/product/11.2.0/db_1/dbs/21p3f19n_1_1
channel ORA_DISK_1: piece handle=/oracle/app/oracle/product/11.2.0/db_1/dbs/21p3f19n_1_1 tag=TAG20140318T033619
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
starting media recovery
archived log for thread 1 with sequence 134 is already on disk as file /oracle/app/oradata/ORCL/onlinelog/o1_mf_2_8xm4g6s5_.log
archived log for thread 1 with sequence 135 is already on disk as file /oracle/app/oradata/ORCL/onlinelog/o1_mf_3_8xm4g8gt_.log
archived log file name=/oracle/app/oradata/ORCL/onlinelog/o1_mf_2_8xm4g6s5_.log thread=1 sequence=134
archived log file name=/oracle/app/oradata/ORCL/onlinelog/o1_mf_3_8xm4g8gt_.log thread=1 sequence=135
media recovery complete, elapsed time: 00:00:03
Finished recover at 18-MAR-14
8.开启数据库:
RMAN> alter database open resetlogs;
database opened