Make opam shared root initialization more robust
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Jan 3 2022, 2:15 PM

Description

The way we initialize the opam shared root on workers, through multiple timer units, is pretty brittle. A lot of the
services actually fail to run because of a race condition or because of "lack of a switch", whatever that means.

Jan 03 00:00:00 worker01 systemd[1]: Started Software Heritage Manage OPAM shared state (coq.inria.fr).
Jan 03 00:00:01 worker01 opam-manage-shared-state.sh[1475377]: [WARNING] No switch is currently set, perhaps you meant '--set-default'?
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [coq.inria.fr] Initialised
Jan 03 00:00:02 worker01 opam-manage-shared-state.sh[1475377]: [ERROR] No switch is currently set. Please use 'opam switch' to set or install a switch
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Main process exited, code=exited, status=50/n/a
Jan 03 00:00:02 worker01 systemd[1]: opam-manage-shared-state-coq.inria.fr.service: Failed with result 'exit-code'.

Instead of having separate timer units, we should probably have a single one which would run a script that would create the default root first, then run snippets updating the created root for each separate instance afterwards.

use set -e in the main script if it's not already set
make the update script update all instances at once, instead of a single one separately: write a main script which handles the creation of the opam root and the update of the main instance, and run snippets generated for each other instance, using run-parts
make sure the worker only starts after a successful run of the main service
make sure the timer unit runs at different times, rather than all at midnight
consider moving the shared root to /var/tmp to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)

Revisions and Commits

rSPSITE puppet-swh-site
	D6883	rSPSITE791085c55b39 Make opam shared root initialization more robust

Related Objects

Mentioned In: rSENV362c16acc1e0: Vagrantfile: Fix worker01 fqdn so puppet applies correctly
rSPSITE39169079d618: vagrant: Reference opam tools in worker vm

Event Timeline

olasd triaged this task as High priority.Jan 3 2022, 2:15 PM

olasd created this task.

ardumont mentioned this in rSPSITE39169079d618: vagrant: Reference opam tools in worker vm.Jan 5 2022, 5:25 PM

ardumont mentioned this in rSENV362c16acc1e0: Vagrantfile: Fix worker01 fqdn so puppet applies correctly.

ardumont added a revision: D6883: Make opam shared root initialization more robust.Jan 6 2022, 11:49 AM

ardumont added a commit: rSPSITE791085c55b39: Make opam shared root initialization more robust.Jan 7 2022, 9:17 AM

ardumont updated the task description. (Show Details)Jan 7 2022, 9:27 AM

Deployed. And alerts are no longer now [1]

consider moving the shared root to /var/tmp to avoid it being blown away by reboots (probably not needed if we make sure the service dependencies are correct)

I did not go the path change way since that impacts current code in both lister and loader (which are running with the default fallback values).
And the dependencies in the systemd are correct as per our iteration on the diff.

[1]

09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker0.internal.staging.swh.network is OK: SYSTEMD OK - all
09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker2.internal.staging.swh.network is OK: SYSTEMD OK - all
09:25 <+swhbot> icinga RECOVERY: service check_systemd on worker1.internal.staging.swh.network is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker07.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker10.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker16.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker01.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker13.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker12.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker08.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker06.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker09.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker04.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker14.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker02.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker11.softwareheritage.org is OK: SYSTEMD OK - all
09:26 <+swhbot> icinga RECOVERY: service check_systemd on worker03.softwareheritage.org is OK: SYSTEMD OK - all
09:27 <+ardumont> good workers!
09:27 <+swhbot> icinga RECOVERY: service check_systemd on worker05.softwareheritage.org is OK: SYSTEMD OK - all
09:27 <+swhbot> icinga RECOVERY: service check_systemd on worker15.softwareheritage.org is OK: SYSTEMD OK - all

ardumont changed the task status from Open to Work in Progress.Jan 7 2022, 9:30 AM

ardumont added a project: System administration.

ardumont moved this task from Backlog to in-progress on the System administration board.

ardumont moved this task from in-progress to deployed/landed/monitoring on the System administration board.

ardumont closed this task as Resolved.Jan 7 2022, 12:23 PM

ardumont claimed this task.

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

This task has been migrated to GitLab.

Make opam shared root initialization more robustClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Make opam shared root initialization more robust
Closed, MigratedEdits Locked
Actions