Page MenuHomeSoftware Heritage

Connection problems with latest celery (4.2.1; docker env, production)
Closed, ResolvedPublic

Description

When I start the docker environment and try to load a bit the engine with tasks, I get several celery/amqp related connection errors:

  1. swh-scheduler runner often (after a while, when inserting new tasks, if one or more workers are loading the system) fall into a failed state where id loops very quickly attempting to (re)make an amqp connection. When it occurs, the first exception is
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 495, in _ensured
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 135, in _maybe_declare
    entity.declare(channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 605, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 614, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 649, in queue_declare
    nowait=nowait,
  File "/usr/local/lib/python3.6/site-packages/amqp/channel.py", line 1150, in queue_declare
    nowait, arguments),
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 51, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 172, in write_frame
    write(view[:offset])
  File "/usr/local/lib/python3.6/site-packages/amqp/transport.py", line 282, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer

then it loops very quickly with the same exception but the last line which becomes a

BrokenPipeError: [Errno 32] Broken pipe

When this error occurs, the runner is in a failed state (despite being still "running") since it cannot send tasks any more, and must be restarted.

Note that these tracebacks are normally catched by kombu, I had to add logging statements in kombu.connection.Connection.ensure (in the _ensured wrapper) to be able to see these.

This problem looks like a bunch of reported issues upstream:

  1. I often have messages in the amqp logs like:
amqp_1    | =WARNING REPORT==== 22-Jan-2019::15:04:09 ===
amqp_1    | closing AMQP connection <0.10097.4795> (172.20.0.27:57486 -> 172.20.0.8:5672, vhost: '/', user: 'guest'):
amqp_1    | client unexpectedly closed TCP connection

These seem to be resulting from a heavily loaded worker (the pypi loader in this case, with 4 workers running). Each time this occurs, a new amqp connection is logged in the worker's log, but no traceback/error message.

  1. flower constantly logs some ConnectionResetError: [Errno 104] Connection reset by peer making it pretty unreliable.

These unexpectedly closed tcp connections might be related to one or more issues:

For the first problem, I have tried with the git (master) version of:

  • amqp on swh-scheduler-runner, did not fix the problem
  • kombu on swh-scheduler-runner, for now it seems to be more stable (no connection problem have occured since I installed kombu from the git repo).

Overall, this gives the impression that celery 4 is pretty unreliable on a moderaltly loaded system (my machine's load is around 2->3 with the pypi loader working with 4 workers, other workers are pretty idle)...

Event Timeline

douardda created this task.Jan 22 2019, 4:18 PM
vlorentz triaged this task as High priority.

I confirm that I do not see ConnectionResetError: [Errno 104] Connection reset by peer and BrokenPipeError: [Errno 32] Broken pipe so far in the runner logs with kombu from git's master.

ardumont added a subscriber: ardumont.EditedFeb 16 2019, 3:00 AM

As per our pair-programming yesterday, I think we reproduce this in production now (with the runner at least).

Related T1525

ardumont renamed this task from Connection problems with celery in the docker env to Connection problems with latest celery (4.2.1; docker env, production).

bunch of celery workers (loader*, lister*) indeed have a ConnectionResetError stacktrace (not necessarily the same):

Feb 17 06:36:19 worker01 python3[22539]: [2019-02-17 06:36:19,352: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/celery/worker/consumer/consumer.py", line 317, in start
                                             blueprint.start(self)
                                           File "/usr/lib/python3/dist-packages/celery/bootsteps.py", line 119, in start
                                             step.start(parent)
                                           File "/usr/lib/python3/dist-packages/celery/worker/consumer/consumer.py", line 593, in start
                                             c.loop(*c.loop_args())
                                           File "/usr/lib/python3/dist-packages/celery/worker/loops.py", line 91, in asynloop
                                             next(loop)
                                           File "/usr/lib/python3/dist-packages/kombu/asynchronous/hub.py", line 354, in create_loop
                                             cb(*cbargs)
                                           File "/usr/lib/python3/dist-packages/kombu/transport/base.py", line 236, in on_readable
                                             reader(loop)
                                           File "/usr/lib/python3/dist-packages/kombu/transport/base.py", line 218, in _read
                                             drain_events(timeout=0)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 491, in drain_events
                                             while not self.blocking_read(timeout):
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 497, in blocking_read
                                             return self.on_inbound_frame(frame)
                                           File "/usr/lib/python3/dist-packages/amqp/method_framing.py", line 55, in on_frame
                                             callback(channel, method_sig, buf, None)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 501, in on_inbound_method
                                             method_sig, payload, content,
                                           File "/usr/lib/python3/dist-packages/amqp/abstract_channel.py", line 128, in dispatch_method
                                             listener(*args)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 621, in _on_close
                                             self._x_close_ok()
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 636, in _x_close_ok
                                             self.send_method(spec.Connection.CloseOk, callback=self._on_close_ok)
                                           File "/usr/lib/python3/dist-packages/amqp/abstract_channel.py", line 50, in send_method
                                             conn.frame_writer(1, self.channel_id, sig, args, content)
                                           File "/usr/lib/python3/dist-packages/amqp/method_framing.py", line 166, in write_frame
                                             write(view[:offset])
                                           File "/usr/lib/python3/dist-packages/amqp/transport.py", line 275, in write
                                             self._write(s)
                                         ConnectionResetError: [Errno 104] Connection reset by peer

Then they cannot reopen the connection to the rabbitmq (it seems a restart is in order).

Opened a dashboard to track those (i don't see the stacktrace reported though, only the consumer connection error log) [1]

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/ebe391f0-3290-11e9-b8ce-cf95f437ce37

olasd added a subscriber: olasd.Feb 19 2019, 9:55 AM

I've pushed an updated kombu to our repository.

I've done the upgrade on saatchi and restarted both listener and runner. I've removed the runner restart from the saatchi crontab.

I've done the upgrade on saatchi and restarted both listener and runner. I've removed the runner restart from the saatchi crontab.

Thank you!

ardumont changed the task status from Open to Work in Progress.EditedFeb 19 2019, 2:38 PM

WIP as the new version has been deployed (runner, listener, workers, etc...)
Let's see if the occurrences still occur.

for now, it did not.

ardumont closed this task as Resolved.Feb 20 2019, 10:00 AM
ardumont claimed this task.

status:

  • runner: fine (no crash, no restart)
  • listener: fine (same)
  • workers: fine (same)

I think it's safe to close now.