这是做运维以来的第一篇日志。平时都是记录在笔记里,以后尝试记录在这里吧,做个整理效果会更好。

给自己定个小目标,以后一周更新两次吧~


我的环境是Redhat 7.2+ Oracle RAC 11204,本来系统已经运行了一段时间了,今天登陆无意间发现节点2的示例down了,而所有的 crs服务都很正常。

于是查看节点2的alert 日志:

vi /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/alert_rac1122.log

  1. Mon Jul 13 11:05:48 2020
  2. Errors in file /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/rac1122_dbw4_28544.trc:
  3. ORA-27157: OS post/wait facility removed
  4. ORA-27300: OS system dependent operation:semop failed with status: 43
  5. ORA-27301: OS failure message: Identifier removed
  6. ORA-27302: failure occurred at: sskgpwwait1
  7. Mon Jul 13 11:05:48 2020
  8. Errors in file /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/rac1122_o000_295057.trc:
  9. ORA-27157: OS post/wait facility removed
  10. ORA-27300: OS system dependent operation:semop failed with status: 43
  11. ORA-27301: OS failure message: Identifier removed
  12. ORA-27302: failure occurred at: sskgpwwait1
  13. DBW4 (ospid: 28544): terminating the instance due to error 27157
  14. Mon Jul 13 11:05:48 2020
  15. Errors in file /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/rac1122_j001_295484.trc:
  16. ORA-27157: OS post/wait facility removed
  17. ORA-27300: OS system dependent operation:semop failed with status: 43
  18. ORA-27301: OS failure message: Identifier removed
  19. ORA-27302: failure occurred at: sskgpwwait1
  20. Mon Jul 13 11:05:48 2020
  21. System state dump requested by (instance=2, osid=28544 (DBW4)), summary=[abnormal instance termination].
  22. System State dumped to trace file /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/rac1122_diag_28495_20200713110548.trc
  23. Dumping diagnostic data in directory=[cdmp_20200713110548], requested by (instance=2, osid=28544 (DBW4)), summary=[abnormal instance termination].
  24. Instance terminated by DBW4, pid = 28544
  25. Errors in file /u01/app/oracle/diag/rdbms/rac112/rac1122/trace/rac1122_dbw4_28544.trc:
  26. ORA-27300: OS system dependent operation:semctl failed with status: 22
  27. ORA-27301: OS failure message: Invalid argument
  28. ORA-27302: failure occurred at: sskgpwrm1
  29. ORA-27157: OS post/wait facility removed
  30. ORA-27300: OS system dependent operation:semop failed with status: 43
  31. ORA-27301: OS failure message: Identifier removed
  32. ORA-27302: failure occurred at: sskgpwwait1
  33. Mon Jul 13 11:05:59 2020
  34. Starting ORACLE instance (normal)
  35. ************************ Large Pages Information *******************
  36. Per process system memlock (soft) limit = UNLIMITED

  37. Total Shared Global Region in Large Pages = 0 KB (0%)

  38. Large Pages used by this instance: 0 (0 KB)
  39. Large Pages unused system wide = 0 (0 KB)
  40. Large Pages configured system wide = 0 (0 KB)
  41. Large Page size = 2048 KB

  42. RECOMMENDATION:
  43.   Total System Global Area size is 450 GB. For optimal performance,
  44.   prior to the next instance restart:
  45.   1. Increase the number of unused large pages by
  46.  at least 230401 (page size 2048 KB, total size 450 GB) system wide to
  47.    get 100% of the System Global Area allocated with large pages
  48. ********************************************************************
  49. LICENSE_MAX_SESSION = 0
  50. LICENSE_SESSIONS_WARNING = 0
  51. Initial number of CPU is 96
  52. Number of processor cores in the system is 48
  53. Number of processor sockets in the system is 4
  54. Private Interface 'eno2:1' configured from GPnP for use as a private interconnect.
  55.   [name='eno2:1', type=1, ip=xx.xx.xx.155, mac=xxxxxxxx, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
  56. Public Interface 'eno1' configured from GPnP for use as a public interface.
  57.   [name='eno1', type=1, ip=xx.xx.xx.122, mac=70-57-bf-39-1c-25, net=xx.xx.xx.0/24, mask=255.255.255.0, use=public/1]
  58. Public Interface 'eno1:1' configured from GPnP for use as a public interface.
  59.   [name='eno1:1', type=1, ip=xx.xx.xx.124, mac=70-57-bf-39-1c-25, net=xx.xx.xx.0/24, mask=255.255.255.0, use=public/1]
  60. CELL communication is configured to use 0 interface(s):
  61. CELL IP affinity details:
  62.     NUMA status: NUMA system w/ 4 process groups
  63.     cellaffinity.ora status: cannot find affinity map at '/etc/oracle/cell/network-config/cellaffinity.ora' (see trace file for details)
  64. CELL communication will use 1 IP group(s):
  65.     Grp 0:
  66. Picked latch-free SCN scheme 3
  67. Mon Jul 13 11:06:10 2020
  68. WARNING: db_recovery_file_dest is same as db_create_file_dest
  69. Autotune of undo retention is turned on.
  70. LICENSE_MAX_USERS = 0
  71. SYS auditing is disabled
  72. NUMA system with 4 nodes detected
  73. Starting up:
  74. Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
  75. With the Partitioning, Real Application Clusters, OLAP, Data Mining
  76. and Real Application Testing options.
  77. ORACLE_HOME = /u01/app/oracle/product/11.2.0/db_1
  78. System name: Linux
  79. Node name: rac2
  80. Release: 3.10.0-327.el7.x86_64
  81. Version: #1 SMP Thu Oct 29 17:29:29 EDT 2015
  82. Machine: x86_64
  83. Using parameter settings in server-side pfile /u01/app/oracle/product/11.2.0/db_1/dbs/initrac1122.ora
  84. System parameters with non-default values:
  85.   processes = 8192
  86.   sessions = 12384
  87.   spfile = "+DATA/rac112/spfilerac112.ora"
  88.   nls_language = "AMERICAN"
  89.   nls_territory = "CHINA"
  90.   sga_target = 450G
  91.   control_files = "+DATA/rac112/controlfile/current.261.1044461323"
  92.   control_files = "+DATA/rac112/controlfile/current.260.1044461323"
  93.   db_block_size = 8192
  94.   compatible = "11.2.0.4.0"
  95.   log_archive_dest_1 = "location=+DATA/RAC112/DBFRA"
  96.   cluster_database = TRUE
  97.   db_create_file_dest = "+DATA"
  98.   db_recovery_file_dest = "+DATA"
  99.   db_recovery_file_dest_size= 440700M
  100.   thread = 2
  101.   undo_tablespace = "UNDOTBS2"
  102.   instance_number = 2
  103.     remote_login_passwordfile= "EXCLUSIVE"
  104.   db_domain = ""
  105.   dispatchers = "(PROTOCOL=TCP) (SERVICE=rac112XDB)"
  106.   remote_listener = "rac-scan:1521"
  107.   audit_file_dest = "/u01/app/oracle/admin/rac112/adump"
  108.   audit_trail = "DB"
  109.   db_name = "rac112"
  110.   open_cursors = 300
  111.   pga_aggregate_target = 115200M
  112.   diagnostic_dest = "/u01/app/oracle"
  113. Cluster communication is configured to use the following interface(s) for this instance
  114.   xx.xx.xx.155
  115. cluster interconnect IPC version:Oracle UDP/IP (generic)
  116. IPC Vendor 1 proto 2
  117. Mon Jul 13 11:06:12 2020
  118. PMON started with pid=2, OS id=295770
  119. Error occured while spawning process PMON; error = 27153
  120. USER (ospid: 295705): terminating the instance due to error 27153
  121. Instance terminated by USER, pid = 295705
查看错误码,是操作系统内核参数的问题:

点击(此处)折叠或打开

  1. [oracle@rac2 trace]$ oerr ora 27157
  2. 27157, 0000, "OS post/wait facility removed"
  3. // *Cause: the post/wait facility for which the calling process is awaiting
  4. // action is removed from the system
  5. // *Action: check errno and contact Oracle Support
  6. [oracle@rac2 trace]$ oerr ora 27300
  7. 27300, 00000, "OS system dependent operation:%s failed with status: %s"
  8. // *Cause: OS system call error
  9. // *Action: contact Oracle Support
百度了一下都说是max user process设置太小了。根据日志时间,当时确实修改了nproc参数:
修改前:

点击(此处)折叠或打开

  1. grid soft nproc 4096
  2. grid hard nproc 3088654
  3. grid soft nofile 1024
  4. grid hard nofile 65536

  5. oracle soft nproc 4096
  6. oracle hard nproc 3088654
  7. oracle soft nofile 1024
  8. oracle hard nofile 65536
修改后:

点击(此处)折叠或打开

  1. grid soft nproc 9000
  2. grid hard nproc 3088654
  3. grid soft nofile 10240
  4. grid hard nofile 655360

  5. oracle soft nproc 9000
  6. oracle hard nproc 3088654
  7. oracle soft nofile 10240
  8. oracle hard nofile 655360
使用ulimit -a查看已经是修改后的值了。
因为Oracle设置的process是8192:

点击(此处)折叠或打开

  1. SQL> show parameter processes;

  2. NAME TYPE VALUE
  3. ------------------------------------ ----------- ------------------------------
  4. aq_tm_processes integer 1
  5. db_writer_processes integer 12
  6. gcs_server_processes integer 5
  7. global_txn_processes integer 1
  8. job_queue_processes integer 1000
  9. log_archive_max_processes integer 4
  10. processes integer 8192
猜想还是修改后没有生效的问题。重启服务器问题解决。





09-02 02:14