Connection problems with latest celery (4.2.1; docker env, production)
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	douardda
	Jan 22 2019, 4:18 PM

Description

When I start the docker environment and try to load a bit the engine with tasks, I get several celery/amqp related connection errors:

swh-scheduler runner often (after a while, when inserting new tasks, if one or more workers are loading the system) fall into a failed state where id loops very quickly attempting to (re)make an amqp connection. When it occurs, the first exception is

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/kombu/connection.py", line 495, in _ensured
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/kombu/common.py", line 135, in _maybe_declare
    entity.declare(channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 605, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 614, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.6/site-packages/kombu/entity.py", line 649, in queue_declare
    nowait=nowait,
  File "/usr/local/lib/python3.6/site-packages/amqp/channel.py", line 1150, in queue_declare
    nowait, arguments),
  File "/usr/local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 51, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/usr/local/lib/python3.6/site-packages/amqp/method_framing.py", line 172, in write_frame
    write(view[:offset])
  File "/usr/local/lib/python3.6/site-packages/amqp/transport.py", line 282, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer

then it loops very quickly with the same exception but the last line which becomes a

BrokenPipeError: [Errno 32] Broken pipe

When this error occurs, the runner is in a failed state (despite being still "running") since it cannot send tasks any more, and must be restarted.

Note that these tracebacks are normally catched by kombu, I had to add logging statements in kombu.connection.Connection.ensure (in the _ensured wrapper) to be able to see these.

This problem looks like a bunch of reported issues upstream:

I often have messages in the amqp logs like:

amqp_1    | =WARNING REPORT==== 22-Jan-2019::15:04:09 ===
amqp_1    | closing AMQP connection <0.10097.4795> (172.20.0.27:57486 -> 172.20.0.8:5672, vhost: '/', user: 'guest'):
amqp_1    | client unexpectedly closed TCP connection

These seem to be resulting from a heavily loaded worker (the pypi loader in this case, with 4 workers running). Each time this occurs, a new amqp connection is logged in the worker's log, but no traceback/error message.

flower constantly logs some ConnectionResetError: [Errno 104] Connection reset by peer making it pretty unreliable.

These unexpectedly closed tcp connections might be related to one or more issues:

For the first problem, I have tried with the git (master) version of:

amqp on swh-scheduler-runner, did not fix the problem
kombu on swh-scheduler-runner, for now it seems to be more stable (no connection problem have occured since I installed kombu from the git repo).

Overall, this gives the impression that celery 4 is pretty unreliable on a moderaltly loaded system (my machine's load is around 2->3 with the pypi loader working with 4 workers, other workers are pretty idle)...

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T1530 Deploy all latest development in production, then refactor deprecated configuration/code
		Migrated	gitlab-migration	T1491 Connection problems with latest celery (4.2.1; docker env, production)

Event Timeline

douardda created this task.Jan 22 2019, 4:18 PM

vlorentz triaged this task as High priority.Jan 24 2019, 11:49 AM

vlorentz added projects: Scheduling utilities, Docker environment.

I confirm that I do not see ConnectionResetError: [Errno 104] Connection reset by peer and BrokenPipeError: [Errno 32] Broken pipe so far in the runner logs with kombu from git's master.

douardda mentioned this in D1084: Simplify and organize a bit the whole docker environment.Feb 6 2019, 10:40 AM

douardda mentioned this in rCDFDf8ca643a0508: Install kombu from sources in scheduler-worker as a workaround for T1491.Feb 6 2019, 3:22 PM

As per our pair-programming yesterday, I think we reproduce this in production now (with the runner at least).

Related T1525

ardumont renamed this task from Connection problems with celery in the docker env to Connection problems with latest celery (4.2.1; docker env, production).Feb 16 2019, 4:21 PM

ardumont added a parent task: T1530: Deploy all latest development in production, then refactor deprecated configuration/code.

bunch of celery workers (loader*, lister*) indeed have a ConnectionResetError stacktrace (not necessarily the same):

Feb 17 06:36:19 worker01 python3[22539]: [2019-02-17 06:36:19,352: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/celery/worker/consumer/consumer.py", line 317, in start
                                             blueprint.start(self)
                                           File "/usr/lib/python3/dist-packages/celery/bootsteps.py", line 119, in start
                                             step.start(parent)
                                           File "/usr/lib/python3/dist-packages/celery/worker/consumer/consumer.py", line 593, in start
                                             c.loop(*c.loop_args())
                                           File "/usr/lib/python3/dist-packages/celery/worker/loops.py", line 91, in asynloop
                                             next(loop)
                                           File "/usr/lib/python3/dist-packages/kombu/asynchronous/hub.py", line 354, in create_loop
                                             cb(*cbargs)
                                           File "/usr/lib/python3/dist-packages/kombu/transport/base.py", line 236, in on_readable
                                             reader(loop)
                                           File "/usr/lib/python3/dist-packages/kombu/transport/base.py", line 218, in _read
                                             drain_events(timeout=0)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 491, in drain_events
                                             while not self.blocking_read(timeout):
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 497, in blocking_read
                                             return self.on_inbound_frame(frame)
                                           File "/usr/lib/python3/dist-packages/amqp/method_framing.py", line 55, in on_frame
                                             callback(channel, method_sig, buf, None)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 501, in on_inbound_method
                                             method_sig, payload, content,
                                           File "/usr/lib/python3/dist-packages/amqp/abstract_channel.py", line 128, in dispatch_method
                                             listener(*args)
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 621, in _on_close
                                             self._x_close_ok()
                                           File "/usr/lib/python3/dist-packages/amqp/connection.py", line 636, in _x_close_ok
                                             self.send_method(spec.Connection.CloseOk, callback=self._on_close_ok)
                                           File "/usr/lib/python3/dist-packages/amqp/abstract_channel.py", line 50, in send_method
                                             conn.frame_writer(1, self.channel_id, sig, args, content)
                                           File "/usr/lib/python3/dist-packages/amqp/method_framing.py", line 166, in write_frame
                                             write(view[:offset])
                                           File "/usr/lib/python3/dist-packages/amqp/transport.py", line 275, in write
                                             self._write(s)
                                         ConnectionResetError: [Errno 104] Connection reset by peer

Then they cannot reopen the connection to the rabbitmq (it seems a restart is in order).

Opened a dashboard to track those (i don't see the stacktrace reported though, only the consumer connection error log) [1]

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/ebe391f0-3290-11e9-b8ce-cf95f437ce37

I've pushed an updated kombu to our repository.

I've done the upgrade on saatchi and restarted both listener and runner. I've removed the runner restart from the saatchi crontab.

I've done the upgrade on saatchi and restarted both listener and runner. I've removed the runner restart from the saatchi crontab.

Thank you!

ardumont mentioned this in rSPSITE02a611f05881: Revert "swh-scheduler-runner: Transitory change to avoid the kombu issue".Feb 19 2019, 11:53 AM

ardumont mentioned this in rDCORE0e8e988cd70c: d/control: Fix the necessary version for msgpack.Feb 19 2019, 12:28 PM

WIP as the new version has been deployed (runner, listener, workers, etc...)
Let's see if the occurrences still occur.

for now, it did not.

status:

runner: fine (no crash, no restart)
listener: fine (same)
workers: fine (same)

I think it's safe to close now.

olasd mentioned this in rCDFD5e8af88dc7f8: Revert "Install kombu from sources in scheduler-worker as a workaround for….Mar 1 2019, 1:31 PM

douardda mentioned this in rDENVf8ca643a0508: Install kombu from sources in scheduler-worker as a workaround for T1491.Nov 26 2019, 1:35 PM

olasd mentioned this in rDENV5e8af88dc7f8: Revert "Install kombu from sources in scheduler-worker as a workaround for….Nov 26 2019, 1:35 PM

This task has been migrated to GitLab.

Connection problems with latest celery (4.2.1; docker env, production)Closed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Connection problems with latest celery (4.2.1; docker env, production)
Closed, MigratedEdits Locked
Actions

Related Objects
Search...