Details
- Reviewers
- None
- Group Reviewers
Reviewers - Maniphest Tasks
- T4144: Elastic worker infrastructure
Diff Detail
- Repository
- rDSNIP Code snippets
- Branch
- master
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 29270 Build 45761: arc lint + arc unit
Event Timeline
What component is providing the health http endpoint in the pod?
Would it be possible to set the health check to use a celery remote control command (e.g. celery -A swh.scheduler.celery_backend.config.app inspect -d $worker_celery_name active_queues) instead?
As for worker termination, I guess it would also make sense to use a custom command which 1/ cancels the active queues (using celery control cancel_consumer $queue); 2/ monitors the active tasks (using celery inspect active, in a loop); 3/ forcefully terminates them after a given timeout
(basically, the liveness probe could do what our current swh-worker-ping-restart script does: call swh scheduler celery-monitor --pattern "$celery_name" ping-workers a couple of times. We just need to figure out how to generate the $celery_name value consistently between the entrypoint and the liveness probe)
What component is providing the health http endpoint in the pod?
That'd be the container probe [1] declaration. In my diff, that's the livenessProbe
declaration. The node agent ('kubelet' running on all nodes) is the one triggering that
check (http here).
[1] https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
Would it be possible to set the health check to use a celery remote control command
(e.g. `celery -A swh.scheduler.celery_backend.config.app inspect -d
$worker_celery_name active_queues`) instead?
From the doc linked, it seems to be possible through the exec declaration part
(instead of httpGet).
As for worker termination, I guess it would also make sense to use a custom command
which 1/ cancels the active queues (using celery control cancel_consumer $queue); 2/
monitors the active tasks (using celery inspect active, in a loop); 3/ forcefully
terminates them after a given timeout
I'm not totally clear on that bit yet (especially regarding impacts on other consumers).
For the 3/ part though, I opened a task a while back for shortening in some ways the
loading [2]. That sounds like something that can be done independently from this though
(as it may be a somewhat heavy endeavor).
[2] T3640
(basically, the liveness probe could do what our current swh-worker-ping-restart
script does: call swh scheduler celery-monitor --pattern "$celery_name" ping-workers
a couple of times. We just need to figure out how to generate the $celery_name value
consistently between the entrypoint and the liveness probe)
Interesting, thx.
I expect that the clean celery worker termination command would also be useful for the current prod setup
In any case, my current implementation is not working ¯\_(ツ)_/¯.
That fails the restart of the pods somehow (segregating out that commit and everyting gets back to normal).