我有一个python脚本,它仅循环检查SQS上的消息,然后停止。如果cron作业未运行,则每隔几分钟就会重新启动该脚本。
#start def main():
------For i from 1 to 100:
-------------Check SQS for new message[establish connections to SQS] # long polling not used, Receive message wait time set to 0.
-------------If new job found:
--------------------ProcessIt()
# end
我发现在EC2实例上运行脚本几天后,该脚本变得陈旧,并且不检查来自SQS的任何新消息。
当我为进程的pid运行lsof时,仅对SQS连接进行grepping时,我发现所有与SQS的连接都在CLOSE_WAIT上。我的问题的解决方法是手动终止并重新启动脚本过程。因此,似乎cron甚至无法重新启动该脚本,因为它已经一直在运行并且卡在对SQS的调用中:
ip-10-x-y-z:~ # lsof -p 9018 | grep "72.21"
ld-linux. 9018 root 7u IPv4 474699439 0t0 TCP ip-10-x-y-z.ec2.internal:58211->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 10u IPv4 474699560 0t0 TCP ip-10-x-y-z.ec2.internal:53428->72.21.194.47:https (CLOSE_WAIT)
ld-linux. 9018 root 12u IPv4 474701017 0t0 TCP ip-10-x-y-z.ec2.internal:52166->72.21.214.70:https (CLOSE_WAIT)
ld-linux. 9018 root 18u IPv4 474694555 0t0 TCP ip-10-x-y-z.ec2.internal:57267->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 22u IPv4 474694573 0t0 TCP ip-10-x-y-z.ec2.internal:57271->72.21.202.145:https (CLOSE_WAIT)
ld-linux. 9018 root 39u IPv4 474701031 0t0 TCP ip-10-x-y-z.ec2.internal:52170->72.21.214.70:https (CLOSE_WAIT)
我知道我应该使用长时间轮询,但是仍然想知道为什么进程会卡住而无法自行恢复。我正在使用Boto 2.23。
任何输入都会有所帮助。
最佳答案
gdb调试为卡住的进程导致了以下回溯:
(gdb) pystack
~/mypackage/lib/python2.6/ssl.py (293): do_handshake
~/mypackage/lib/python2.6/ssl.py (120): __init__
~/mypackage/lib/python2.6/ssl.py (350): wrap_socket
~/mypackage/lib/python2.6/site-packages/boto/https_connection.py (118): connect
~/mypackage/lib/python2.6/httplib.py (725): send
~/mypackage/lib/python2.6/httplib.py (764): _send_output
~/mypackage/lib/python2.6/httplib.py (892): endheaders
~/mypackage/lib/python2.6/httplib.py (937): _send_request
~/mypackage/lib/python2.6/httplib.py (899): request
~/mypackage/lib/python2.6/site-packages/boto/connection.py (902): _mexe
~/mypackage/lib/python2.6/site-packages/boto/connection.py (1063): make_request
~/mypackage/lib/python2.6/site-packages/boto/connection.py (1138): get_object
~/mypackage/lib/python2.6/site-packages/boto/sqs/connection.py (355): get_queue
~/mypackage/lib/python2.6/site-packages/sqs/SQSHelper.py (96): __init__
~/mypackage/sqs/SQSWrapper.py (1229): main
~/mypackage/sqs/SQSWrapper.py (1367): <module>
如我们所见,我的脚本停留在SQS的get_queue()API上。
似乎问题出在python 2.6的ssl握手功能中,该功能已在python 2.7中修复,但有人在python 2.7中也报告了相同的问题[请参见下面的链接]。我将使用Python 2.7并在SQS包装器代码中的SQS API上设置几分钟的超时,以解决整个问题:
以下链接帮助我归结为根本原因和解决方法:
http://bugs.python.org/issue5103
http://hg.python.org/cpython/rev/ce4916ca06dd/
Web app hangs for several hours in ssl.py at self._sslobj.do_handshake()
Timeout function if it takes too long to finish