Page MenuHomeSoftware Heritage

elastic-workers: Let the loader some time to finish gracefully
Changes PlannedPublic

Authored by ardumont on May 10 2022, 6:55 PM.

Diff Detail

Repository
rDSNIP Code snippets
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29270
Build 45761: arc lint + arc unit

Event Timeline

What component is providing the health http endpoint in the pod?

Would it be possible to set the health check to use a celery remote control command (e.g. celery -A swh.scheduler.celery_backend.config.app inspect -d $worker_celery_name active_queues) instead?

As for worker termination, I guess it would also make sense to use a custom command which 1/ cancels the active queues (using celery control cancel_consumer $queue); 2/ monitors the active tasks (using celery inspect active, in a loop); 3/ forcefully terminates them after a given timeout

(basically, the liveness probe could do what our current swh-worker-ping-restart script does: call swh scheduler celery-monitor --pattern "$celery_name" ping-workers a couple of times. We just need to figure out how to generate the $celery_name value consistently between the entrypoint and the liveness probe)

What component is providing the health http endpoint in the pod?

That'd be the container probe [1] declaration. In my diff, that's the livenessProbe
declaration. The node agent ('kubelet' running on all nodes) is the one triggering that
check (http here).

[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

Would it be possible to set the health check to use a celery remote control command
(e.g. `celery -A swh.scheduler.celery_backend.config.app inspect -d
$worker_celery_name active_queues`) instead?

From the doc linked, it seems to be possible through the exec declaration part
(instead of httpGet).

As for worker termination, I guess it would also make sense to use a custom command
which 1/ cancels the active queues (using celery control cancel_consumer $queue); 2/
monitors the active tasks (using celery inspect active, in a loop); 3/ forcefully
terminates them after a given timeout

I'm not totally clear on that bit yet (especially regarding impacts on other consumers).
For the 3/ part though, I opened a task a while back for shortening in some ways the
loading [2]. That sounds like something that can be done independently from this though
(as it may be a somewhat heavy endeavor).

[2] T3640

(basically, the liveness probe could do what our current swh-worker-ping-restart
script does: call swh scheduler celery-monitor --pattern "$celery_name" ping-workers
a couple of times. We just need to figure out how to generate the $celery_name value
consistently between the entrypoint and the liveness probe)

Interesting, thx.

I expect that the clean celery worker termination command would also be useful for the current prod setup

In any case, my current implementation is not working ¯\_(ツ)_/¯.
That fails the restart of the pods somehow (segregating out that commit and everyting gets back to normal).