The way we initialize the opam shared root on workers, through multiple timer units, is pretty brittle. A lot of the
services actually fail to run because of a race condition or because of "lack of a switch", whatever that means.
Jan 03 00:00:00 worker01 systemd[1]: Started Software Heritage Manage OPAM shared state (coq.inria.fr). Jan 03 00:00:01 worker01 opam-manage-shared-state.sh[1475377]: [WARNING] No switch is currently set, perhaps you meant '--set-default'? Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [coq.inria.fr] Initialised Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Main process exited, code=exited, status=50/n/a Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Failed with result 'exit-code'.
Instead of having separate timer units, we should probably have a single one which would run a script that would create the default root first, then run snippets updating the created root for each separate instance afterwards.
- use set -e in the main script if it's not already set
- make the update script update all instances at once, instead of a single one separately: write a main script which handles the creation of the opam root and the update of the main instance, and run snippets generated for each other instance, using run-parts
- make sure the worker only starts after a successful run of the main service
- make sure the timer unit runs at different times, rather than all at midnight
- consider moving the shared root to /var/tmp to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)