Page MenuHomeSoftware Heritage

Make opam shared root initialization more robust
Closed, MigratedEdits Locked

Description

The way we initialize the opam shared root on workers, through multiple timer units, is pretty brittle. A lot of the
services actually fail to run because of a race condition or because of "lack of a switch", whatever that means.

Jan 03 00:00:00 worker01 systemd[1]: Started Software Heritage Manage OPAM shared state (coq.inria.fr).
Jan 03 00:00:01 worker01 opam-manage-shared-state.sh[1475377]: [WARNING] No switch is currently set, perhaps you meant '--set-default'?
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [coq.inria.fr] Initialised
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Main process exited, code=exited, status=50/n/a
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Failed with result 'exit-code'.

Instead of having separate timer units, we should probably have a single one which would run a script that would create the default root first, then run snippets updating the created root for each separate instance afterwards.

  • use set -e in the main script if it's not already set
  • make the update script update all instances at once, instead of a single one separately: write a main script which handles the creation of the opam root and the update of the main instance, and run snippets generated for each other instance, using run-parts
  • make sure the worker only starts after a successful run of the main service
  • make sure the timer unit runs at different times, rather than all at midnight
  • consider moving the shared root to /var/tmp to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)

Event Timeline

olasd triaged this task as High priority.Jan 3 2022, 2:15 PM
olasd created this task.

Deployed. And alerts are no longer now [1]

consider moving the shared root to /var/tmp to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)

I did not go the path change way since that impacts current code in both lister and loader (which are running with the default fallback values).
And the dependencies in the systemd are correct as per our iteration on the diff.

[1]

09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker0.internal.staging.swh.network is OK: SYSTEMD OK - all
09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker2.internal.staging.swh.network is OK: SYSTEMD OK - all
09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker1.internal.staging.swh.network is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker07.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker10.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker16.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker01.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker13.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker12.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker08.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker06.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker09.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker04.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker14.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker02.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker11.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker03.softwareheritage.org is OK: SYSTEMD OK - all
09:27 <+ardumont> good workers!
09:27 <+swhbot> icinga RECOVERY: service check_systemd on worker05.softwareheritage.org is OK: SYSTEMD OK - all
09:27 <+swhbot> icinga RECOVERY: service check_systemd on worker15.softwareheritage.org is OK: SYSTEMD OK - all
ardumont changed the task status from Open to Work in Progress.Jan 7 2022, 9:30 AM
ardumont moved this task from Backlog to in-progress on the System administration board.
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.