celery .delay挂起(最近，不是auth问题)

本文介绍了celery .delay挂起(最近，不是auth问题)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用RabbitMQ 2.1.1作为后端运行Celery 2.2.4/djCelery 2.2.4.我最近将两台新的celery服务器上线了-我已经在两台机器上运行了2个工人，总共有约18个线程，在我的新包装盒(36g RAM +双超线程四核)上，我正在运行10个每个线程有8个线程的工人，总共有180个线程-我的任务都非常小，所以应该没事.

I am running Celery 2.2.4/djCelery 2.2.4, using RabbitMQ 2.1.1 as a backend. I recently brought online two new celery servers -- I had been running 2 workers across two machines with a total of ~18 threads and on my new souped up boxes (36g RAM + dual hyper-threaded quad-core), I am running 10 workers with 8 threads each, for a total of 180 threads -- my tasks are all pretty small so this should be fine.

最近几天，节点一直运行良好，但是今天我注意到 .delaay()正在挂起.当我中断它时，我会看到一个指向这里的回溯:

The nodes have been running fine for the last few days, but today I noticed that .delaay() is hanging. When I interrupt it, I see a traceback that points here:

File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 324, in delay
    return self.apply_async(args, kwargs)
File "/home/django/deployed/releases/20110608183345/virtual-env/lib/python2.5/site-packages/celery/task/base.py", line 449, in apply_async
    publish.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/kombu/compat.py", line 108, in close
    self.backend.close()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/channel.py", line 194, in close
    (20, 41),    # Channel.close_ok
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/abstract_channel.py", line 89, in wait
    self.channel_id, allowed_methods)
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/connection.py", line 198, in _wait_method
    self.method_reader.read_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 212, in read_method
    self._next_method()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/method_framing.py", line 127, in _next_method
    frame_type, channel, payload = self.source.read_frame()
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 109, in read_frame
    frame_type, channel, size = unpack('>BHI', self._read(7))
File "/home/django/deployed/virtual-env/lib/python2.5/site-packages/amqplib/client_0_8/transport.py", line 200, in _read
    s = self.sock.recv(65536)

我已经检查了Rabbit日志，并且看到了尝试以以下方式进行连接的过程:

I've checked the Rabbit logs, and I see it the process trying to connect as:

=INFO REPORT==== 12-Jun-2011::22:58:12 ===
accepted TCP connection on 0.0.0.0:5672 from x.x.x.x:48569

我已将Celery日志级别设置为 INFO ，但是在Celery日志中没有看到任何特别有趣的东西，除了其中两个工人无法连接到经纪人:

I have my Celery log level set to INFO, but I don't see anything particularly interesting in the Celery logs EXCEPT that 2 of the workers can't connect to the broker:

[2011-06-12 22:41:08,033: ERROR/MainProcess] Consumer: Connection to broker lost. Trying to re-establish connection...

所有其他节点都可以正常连接.

All of the other nodes are able to connect without issue.

我知道有一个帖子()，但性质类似，但我可以肯定这是不同的.可能是纯粹的工人数量正在 amqplib 中创建某种种族条件-我发现此线程，似乎表明 amqplib 不是线程安全的，不确定这对Celery是否重要.

I know that there was a posting ( RabbitMQ / Celery with Django hangs on delay/ready/etc - No useful log info ) last year of a similar nature, but I'm pretty certain that this is different. Could it be that the sheer number of workers is creating some sort of a race condition in amqplib -- I found this thread which seems to indicate that amqplib is not thread-safe, not sure if this matters for Celery.

我已经在两个节点上尝试了 celeryctl清除 －在一个节点上成功，但是在另一个节点上失败，并出现以下AMQP错误:

I've tried celeryctl purge on both nodes -- on one it succeeds, but on the other it fails with the following AMQP error:

AMQPConnectionException(reply_code, reply_text, (class_id, method_id))
    amqplib.client_0_8.exceptions.AMQPConnectionException:
    (530, u"NOT_ALLOWED - cannot redeclare exchange 'XXXXX' in vhost 'XXXXX'
     with different type, durable or autodelete   value", (40, 10), 'Channel.exchange_declare')

在两个节点上， inspect stats 挂起并显示无法关闭连接".追溯以上.我在这里茫然.

On both nodes, inspect stats hangs with the "can't close connection" traceback above. I'm at a loss here.

EDIT2 :我能够使用 exchange.delete 从 camqadm 删除有问题的交换，现在第二个节点也挂起了:(.

I was able to delete the offending exchange using exchange.delete from camqadm and now the second node hangs too :(.

最近发生的一件事是，我在我的登台节点连接到的Rabbitmq中添加了一个额外的虚拟主机.

One thing that also recently changed is that I added an additional vhost to rabbitmq, which my staging node connects to.

hangs

celery .delay挂起(最近，不是auth问题)

问题描述

推荐答案