问题描述
我们有一个在Tomcat 6上运行的应用程序(确切地说是6.0.35.0),由于Catalina.await方法中的socketAccept调用抛出SocketException,我们在Mac OS上的大多数工程师都遇到启动Tomcat的问题: / p>
We have an application that runs on Tomcat 6 (6.0.35.0 to be precise), and most of our engineers on Mac OS are having problems starting Tomcat due to the socketAccept call in the Catalina.await method throwing a SocketException:
SEVERE: StandardServer.await: accept:
java.net.SocketException: Invalid argument
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:522)
at java.net.ServerSocket.accept(ServerSocket.java:490)
at org.apache.catalina.core.StandardServer.await(StandardServer.java:431)
at org.apache.catalina.startup.Catalina.await(Catalina.java:676)
at org.apache.catalina.startup.Catalina.start(Catalina.java:628)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
at mycompany.tomcat.startup.ThreadDumpWrapper.main(ThreadDumpWrapper.java:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.tanukisoftware.wrapper.WrapperStartStopApp.run(WrapperStartStopApp.java:238)
at java.lang.Thread.run(Thread.java:722)
这会导致Tomcat在启动后立即关闭(并且没有少量的愤怒)。
我们认为这在Mac OS w / Java 1.7上持续了一段时间,在过去的几个月里,我们很多人已经转向Macbook Pros。到目前为止,唯一的症状是来自Tomcat的偶尔零字节响应,因为这个异常也被抛出在socketRead上。错误没有打到日志,我们单独耸了耸肩作为一个孤立的问题,只有在启动问题开始时才找到原因并且我设置了一个SocketException断点:
This causes Tomcat to shut down immediately after startup (and no small amount of rage).We think this has been with us for the duration on Mac OS w/ Java 1.7, in the last several months a lot of us have switched to Macbook Pros. Up until now, the only symptom was occasional zero byte responses from Tomcat, due to this exception also being thrown on a socketRead. Errors don't hit the logs and we'd individually shrugged it off as an isolated problem, and only found the cause when the startup problem started and I set a SocketException breakpoint:
Daemon Thread [http-8080-1] (Suspended (breakpoint at line 47 in SocketException))
SocketException.<init>(String) line: 47
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available
SocketInputStream.read(byte[], int, int, int) line: 150
SocketInputStream.read(byte[], int, int) line: 121
InternalInputBuffer.fill() line: 735
InternalInputBuffer.parseRequestLine() line: 366
Http11Processor.process(Socket) line: 814
Http11Protocol$Http11ConnectionHandler.process(Socket) line: 602
JIoEndpoint$Worker.run() line: 489
Thread.run() line: 722
对于参数:
arg0 FileDescriptor (id=499)
fd 1097
useCount AtomicInteger (id=503)
value 2
arg1 (id=502)
arg2 0
arg3 8192
arg4 20000
问题是时间敏感的。由于应用程序的更改而增加启动时间(更多的Spring内省/单例开销)似乎是导致这会影响Tomcat启动的因素;引爆点约为160秒。我们可以通过禁用开发期间不需要的一些非强制性上下文来减少启动时间来缓解问题,但我更愿意找到根本原因。
The problem is time sensitive. Increasing startup time due to application changes (lots more Spring introspection/singleton overhead) seems to be the factor that causes this to affect Tomcat startup; the tipping point being about 160 seconds. We can mitigate the problem by disabling some of the non-mandatory contexts we don't need during development to reduce startup time, but I'd prefer to find the root cause.
应用程序的细节太复杂而无法详细介绍,但我有一种预感,这可能与早期的绑定有关,所以我至少会在我的机器上列出监听端口:
The specifics of the application are far too complex to go into too much detail, but I have a hunch that this might relate to an earlier bind, so I'll at least list the listening ports on my machine:
localhost:32000 - Java service wrapper port
*:10001 - RMI registry
*:2322 - Java debug
*:56566 - RMI
*:8180 - Tomcat HTTP connector
*:8543 - Tomcat HTTPS connector
*:2223 - Tomcat Internal HTTP connector (used for cross-server requests)
*:14131 - 'Locking' port to determine if an internal service is running
*:56571 - EhCache RMI
*:56573 - RMI
*:62616 - ActiveMQ broker
*:5001 - SOAPMonitorService
*:8109 - Tomcat shutdown port
排除的项目
- 最明显的解决方案:
-Djava.net.preferIPv4Stack = true
。我总是配置了该选项 - 我们的基本应用程序配置,库,JVM选项(没有任何)的任何最新配置更改
- JDK回归。我已经测试了JDK 1.7.0_09,11,15,17和21(我已经在我的机器上安装了JDK)
- Mac OS更新。 Mac OS 10.7.x和10.8.0到1.8.3受到影响
- 文件描述符限制 - 从
5000
增加到10000
- 在主以太网接口上完全禁用IPv6
- 设置断点,并删除第一个上下文受SocketException的影响(它们是对Web服务的传出HTTP调用)。无变化
- 配置
/ etc / hosts
,以便机器主机名解析为localhost,并配置JVM选项以优先选择IPv4和不更喜欢IPv6地址(答案: ) - The most obvious solution:
-Djava.net.preferIPv4Stack=true
. I've always had that option configured - Any recent configuration change to our base application configuration, libraries, JVM options (there aren't any)
- A JDK regression. I've tested JDK 1.7.0_09, 11, 15, 17 and 21 (the JDKs I've had installed on my machine for the duration)
- Mac OS update. Mac OS 10.7.x and 10.8.0 through 1.8.3 are affected
- File descriptor limits - increased from
5000
to10000
- Disabling IPv6 completely on the main ethernet interface
- Setting breakpoints, and removing the first contexts to be affected by the SocketException (they're outgoing HTTP calls to web services). No change
- Configuring
/etc/hosts
so the machine hostname resolves to localhost, and configuring JVM options to prefer IPv4 and to not prefer IPv6 addresses (This answer: https://stackoverflow.com/a/16318860/364206)
Items ruled out
对于那些对主机配置感兴趣的人,它与默认配置相同。我可以在Fusion VM上重现这一点,干净安装10.8:
For those interested in hosts configuration, it's the same as default. I can reproduce this on a Fusion VM w/ a clean install of 10.8:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
Java代码调查
由于问题的时间敏感性,设置断点来解决问题导致它不会发生。根据评论中的要求,我还为 SocksSocketImpl(PlainSocketImpl).socketAccept(SocketImpl)
捕获了 arg0
,似乎没有任何结果平凡的。
Java code investigation
Due to the apparent time sensitive nature of the issue, setting breakpoints to troubleshoot the issue causes it to not occur. As requested in the comments, I also captured arg0
for SocksSocketImpl(PlainSocketImpl).socketAccept(SocketImpl)
, nothing seems out of the ordinary.
arg0 SocksSocketImpl (id=460)
address InetAddress (id=465)
canonicalHostName null
holder InetAddress$InetAddressHolder (id=475)
address 0
family 0
hostName null
applicationSetProxy false
closePending false
cmdIn null
cmdOut null
cmdsock null
CONNECTION_NOT_RESET 0
CONNECTION_RESET 2
CONNECTION_RESET_PENDING 1
external_address null
fd FileDescriptor (id=713)
fd -1
useCount AtomicInteger (id=771)
value 0
fdLock Object (id=714)
fdUseCount 0
localport 0
port 0
resetLock Object (id=716)
resetState 0
server null
serverPort 1080
serverSocket null
shut_rd false
shut_wr false
socket Socket (id=718)
bound false
closed false
closeLock Object (id=848)
connected false
created false
impl null
oldImpl false
shutIn false
shutOut false
socketInputStream null
stream false
timeout 0
trafficClass 0
useV4 false
I think all of the threads where the exceptions are thrown are victims of an earlier call, one that doesn't result in a SocketException so I haven't been able to catch it. Being able to start Tomcat by reducing startup times convinces me that the trigger is probably some scheduled task that performs a socket based operation, which then affects other socket operations.
这不是解释为什么这可能影响多个线程,无论我们做什么导致这种情况,神秘的SocketExceptions不应该从本机代码冒出来并在多个线程上同时导致这些异常 - 也就是说,两个线程使得传出Web服务调用,Tomcat等待调用,以及多个TP处理器线程重复。
That doesn't explain how and why this could affect several threads, no matter what we're doing to cause this condition, mysterious SocketExceptions shouldn't bubble up from native code and cause these exceptions simultaneously on multiple threads - that is, two threads making outgoing web service calls, the Tomcat await call, and several TP processor threads repeatedly.
给定通用消息,我假设必须从socketAccept JNI代码中的一个系统调用返回 EINVAL
错误,所以我跟踪导致异常的系统调用;任何系统调用都没有返回 EINVAL
。所以,我去OpenJDK源寻找socketAccept代码中的条件,该代码将设置然后抛出 EINVAL
,但我也找不到任何设置<$的代码c $ c> errno 到 EINVAL
,或者调用 NET_ThrowByNameWithLastError
, NET_ThrowCurrent
或 NET_ThrowNew
以一种方式抛出SocketException并显示此默认错误消息。
Given the generic message, I assumed that an EINVAL
error must be returned from one of the system calls in the socketAccept JNI code, so I traced the system calls leading up to the exception; there's no EINVAL
returned from any system call. So, I went to the OpenJDK sources looking for conditions in the socketAccept code that would set and then throw an EINVAL
, but I also couldn't find any code that sets errno
to EINVAL
, or calls NET_ThrowByNameWithLastError
, NET_ThrowCurrent
or NET_ThrowNew
in a way that would throw a SocketException with this default error message.
就系统调用而言,我们似乎没有接受系统调用:
As far as the system calls, we don't seem to get as far as the accept system call:
PID/THRD RELATIVE ELAPSD CPU SYSCALL(args) = return
6606/0x2c750d: 221538243 5 0 sigprocmask(0x1, 0x0, 0x14D8BE100) = 0x0 0
6606/0x2c750d: 221538244 3 0 sigaltstack(0x0, 0x14D8BE0F0, 0x0) = 0 0
6606/0x2c750d: 221538836 14 10 socket(0x2, 0x1, 0x0) = 1170 0
6606/0x2c750d: 221538837 3 0 fcntl(0x492, 0x3, 0x4) = 2 0
6606/0x2c750d: 221538839 3 1 fcntl(0x492, 0x4, 0x6) = 0 0
6606/0x2c750d: 221538842 5 2 setsockopt(0x492, 0xFFFF, 0x4) = 0 0
6606/0x2c750d: 221538852 7 4 bind(0x492, 0x14D8BE5D8, 0x10) = 0 0
6606/0x2c750d: 221538857 5 2 listen(0x492, 0x1, 0x4) = 0 0
6606/0x2c750d: 221539625 6 2 psynch_cvsignal(0x7FEFBFE00868, 0x10000000200, 0x100) = 257 0
6606/0x2c750d: 221539633 4 1 write(0x2, "Apr 18, 2013 11:05:35 AM org.apache.catalina.core.StandardServer await\nSEVERE: StandardServer.await: accept: \njava.net.SocketException: Invalid argument\n\tat java.net.PlainSocketImpl.socketAccept(Native Method)\n\tat java.net.PlainSocketImpl.socketAcce", 0x644) = 1604 0
所以,我认为问题发生在 socketAccept ,但我找不到 NET_Timeout
将 errno
设置为的任何情况 EINVAL
,并导致抛出此SocketException。我指的是这段代码;我假设jdk7u分支大部分是Oracle JDK中的内容:
So, I think the problem occurs in the timeout handling code at the top of the accept loop in socketAccept
, but I couldn't find any case where NET_Timeout
would set errno
to EINVAL
, and result in this SocketException being thrown. I'm referring to this code; I assume the jdk7u branch is for the most part what ships in the Oracle JDK:
- http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/PlainSocketImpl.c
- http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/bsd_close.c
- http://hg.openjdk.java.net/jdk7u/jdk7u/jdk/file/d4bf5c15837c/src/solaris/native/java/net/net_util_md.c
我无法找到受此特征影响的外界人士在Mac OS上出现问题,但这里几乎每个人都受到影响。必须有一些应用程序配置有所贡献,但我已经用尽了我能想到的每个途径来找到根本原因。
I can't find anyone in the outside world affected by this particular problem on Mac OS, but almost everyone here is affected. There must be some application configuration that contributes, but I've exhausted every avenue I can think of to find the root cause.
有关可能原因的故障排除或见解的指示非常感谢。
Pointers on troubleshooting or insight on a possible cause would be much appreciated.
推荐答案
你试过 -Xcheck:JNI ?有趣的是使用 PlainSocketImpl.socketAccept
错误作为何时使用它的示例。
Have you tried turning on JNI debugging with -Xcheck:jni
? Interestingly the Oracle documentation uses a PlainSocketImpl.socketAccept
error as an example of when to use this.
另请注意是JNI在大多数平台上使用 poll()
但是 select()
由于Mac上的 poll()
有问题,在Mac OS上。所以也许 select()
也被打破了。进一步挖掘,如果ndfs大于FD_SETSIZE并且未定义_DARWIN_UNLIMITED_SELECT,则select()将返回EINVAL。 FD_SETSIZE是1024,听起来你有大量的应用程序加载,所以也许所有过滤器一直等待更多的1024个FD。
Note also that the implication of Bug 7131399 is that the JNI uses poll()
on most platforms but select()
on Mac OS due to a problem with poll()
on the Mac. So maybe select()
is broken too. Digging in a bit further, select() will return EINVAL if "ndfs is greater than FD_SETSIZE and _DARWIN_UNLIMITED_SELECT is not defined." FD_SETSIZE is 1024 and it sounds like you have a ton of applications loading up, so perhaps it all filters down to waiting on more that 1024 FDs at one time.
额外的信用,请参阅是否在事实已经固定在你的机器上错误报告指向测试用例。
For extra credit, see if the related (supposedly fixed) Java bug is in fact fixed on your machine. The bug report has pointers to test cases.
感谢Old Pro的回答,我确认 select()
FD_SETSIZE限制是原因。我找到了此限制的现有错误:
Thanks to Old Pro's answer, I confirmed that the select()
FD_SETSIZE limitation is the cause. I located an existing bug for this limitation:
可以使用以下代码重现该问题:
The problem can be reproduced with the following code:
import java.io.*;
import java.net.*;
public class SelectTest {
public static void main(String[] args) throws Exception {
// Use 1024 file descriptors. There'll already be some in use, obviously, but this guarantees the problem will occur
for(int i = 0; i < 1024; i++) {
new FileInputStream("/dev/null");
}
ServerSocket socket = new ServerSocket(8080);
socket.accept();
}
}
差不多一年后,Java 7u60解决了这个问题问题:
Almost a year later, Java 7u60 has a fix this problem:
I还发现Tomcat的WebappClassLoader在90秒后关闭文件句柄,这解释了为什么设置断点可以防止问题发生。
I also discovered the Tomcat's WebappClassLoader closes file handles after 90 seconds, which explains why setting break points prevented the issue from occurring.
这篇关于由于Mac OS X上的“java.net.SocketException Invalid argument”,Tomcat启动失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!