When I start the docker environment and try to load a bit the engine with tasks, I get several celery/amqp related connection errors:
- swh-scheduler runner often (after a while, when inserting new tasks, if one or more workers are loading the system) fall into a failed state where id loops very quickly attempting to (re)make an amqp connection. When it occurs, the first exception is
Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 495, in _ensured return fun(*args, **kwargs) File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 135, in _maybe_declare entity.declare(channel=channel) File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 605, in declare self._create_queue(nowait=nowait, channel=channel) File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 614, in _create_queue self.queue_declare(nowait=nowait, passive=False, channel=channel) File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 649, in queue_declare nowait=nowait, File "/usr/local/lib/python3.6/site-packages/amqp/channel.py", line 1150, in queue_declare nowait, arguments), File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 51, in send_method conn.frame_writer(1, self.channel_id, sig, args, content) File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 172, in write_frame write(view[:offset]) File "/usr/local/lib/python3.6/site-packages/amqp/transport.py", line 282, in write self._write(s) ConnectionResetError: [Errno 104] Connection reset by peer
then it loops very quickly with the same exception but the last line which becomes a
BrokenPipeError: [Errno 32] Broken pipe
When this error occurs, the runner is in a failed state (despite being still "running") since it cannot send tasks any more, and must be restarted.
Note that these tracebacks are normally catched by kombu, I had to add logging statements in kombu.connection.Connection.ensure (in the _ensured wrapper) to be able to see these.
This problem looks like a bunch of reported issues upstream:
- I often have messages in the amqp logs like:
amqp_1 | =WARNING REPORT==== 22-Jan-2019::15:04:09 === amqp_1 | closing AMQP connection <0.10097.4795> (172.20.0.27:57486 -> 172.20.0.8:5672, vhost: '/', user: 'guest'): amqp_1 | client unexpectedly closed TCP connection
These seem to be resulting from a heavily loaded worker (the pypi loader in this case, with 4 workers running). Each time this occurs, a new amqp connection is logged in the worker's log, but no traceback/error message.
- flower constantly logs some ConnectionResetError: [Errno 104] Connection reset by peer making it pretty unreliable.
These unexpectedly closed tcp connections might be related to one or more issues:
- https://github.com/celery/celery/issues/4355
- https://github.com/celery/celery/issues/4108
- https://github.com/celery/celery/issues/4226
- https://github.com/celery/celery/issues/3377
For the first problem, I have tried with the git (master) version of:
- amqp on swh-scheduler-runner, did not fix the problem
- kombu on swh-scheduler-runner, for now it seems to be more stable (no connection problem have occured since I installed kombu from the git repo).
Overall, this gives the impression that celery 4 is pretty unreliable on a moderaltly loaded system (my machine's load is around 2->3 with the pypi loader working with 4 workers, other workers are pretty idle)...