AWS数据库云服务器出现了连接超时的错误,于是查看相关时段的alert日志,发现了如下的错误:

***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
Time: 25-OCT-2017 05:00:12
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=x.x.x.x)(PORT=4459))
Wed Oct 25 05:18:13 2017
***********************************************************************

网络连接超时,鉴于以前也经常遇到类似的TNS错误,因此需要彻底搞定它。查看MOS后找到如下文章:

Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (文档 ID 1628949.1)

主要内容如下:

CAUSE
The 'nt secondary err code' identifies the underlying network transport, such as (TCP/IP) timeout limit. In the current case 60 identifies Windows underlying transport layer.
The "nt secondary err code" will be different based on the operating system:
Linux x86 or Linux x86-64: "nt secondary err code: 110"
HP-UX : "nt secondary err code: 238"
AIX: "nt secondary err code: 78"
Solaris: "nt secondary err code: 145"

The alert.log message indicates that a connection was terminated AFTER it was established to the instance. In this case, it was terminated 2 hours and 3 minutes after the listener

handed the connection to the database.

This would indicate an issue with a firewall where a maximum idle time setting is in place.
The connection would not necessarily be "idle". This issue can arise during a long running query or when using JDBC Thin connection pooling.
If there is no data 'on the wirte for lengthy periods of time for any reason, the firewall might terminate the connection.

SOLUTION
The non-Oracle solution would be to remove or increase the firewall setting for maximum idle time. In cases where this is not feasible,
Oracle offers the following suggestion:

The following parameter, set at the **RDBMS_HOME/network/admin/sqlnet.ora, can resolve this kind of problem.
DCD or SQLNET.EXPIRE_TIME can mimic data transmission between the server and the client during long periods of idle time.

SQLNET.EXPIRE_TIME=n Where <n> is a non-zero value set in minutes.

Once this change is in place, there is NO need to restart the listener or the database.
The change will be in place for all newly spawned server processes following the change.
Be aware that connections that were established prior to this setting would not be affected by this change. Therefore, you may continue to experience some timeouts until all remote
connection are established with this setting in place.

See the following : Note 257650.1 Resolving Problems with Connection Idle Timeout With Firewall

从文中我们看出,这可能是防火墙策略或者JDBC连接池的原因,服务器自动将长时间没有响应的会话关闭,可以通过设置SQLNET.EXPIRE_TIME来解决。

文中也列出了不同操作系统报出的不同nt secondary err code。

问题的基本诊断思路如下:

1.查找问题时间段的alert日志和监听日志,其位置分别是$ORACLE__BASE/diag/rdbms/db_name/$ORACLE_SID/trace/和$ORACLE__BASE/diag/tnslsnr/hostname/listener/trace/

2.如果alert日志显示如上报错,但listener日志显示正常,那么说明确实是数据库以外的问题,按如上文档的solution解决即可。

3.如果监听日志显示了其他报错,那么再根据报错找到相应的解决办法。

05-11 15:53