Page MenuHomeSoftware Heritage

Reliable monitoring of services: for users and for admins
Started, Work in Progress, NormalPublic

Description

This meta-task tracks activities geared towards building reliable monitoring indicators of the Software Heritage serivices for our users and for our own admins. Every service disruption should be tracked clearly, and avoid messages on IRC saying "oh, it's normal that XXX does not work, we did YYY some weeks ago"

This involves:

  • ensuring that status.softwareheritage.org is always faithfully representing the operational status of the infrastructures
  • refining if necessary the list of services on status.softwareheritage.org
  • clear planning announcements of scheduled downtime or changes to APIs/WebApp or any other user-visible feature
  • add missing monitoring points as needed

On the admin side, we also need to clearly identify the key indicators we want to follow, and reduce noise on them: these indicators should all be green during normal operation, and only show alerts that are meaningful and require intervention.

Event Timeline

vlorentz triaged this task as Normal priority.Mar 15 2021, 12:29 PM
vsellier changed the task status from Open to Work in Progress.May 20 2021, 12:01 PM

for the status.swh.org point of view, status.io is providing some api endpoint to push metrics. It should be possible to add some metrics (up to 10 with our plan) to expose the behavior of the platform (daily/weekly and monthly statistics).
As a first step, we could expose the number of pending save code now requests and the number of origin visits to have some live data. An example of a status page with metrics : https://status.docker.com/
I'm working on a code snippet to test the integration feasibility/complexity.

For the user facing services, the vault service is not exposed on status page.
We could also add an integration with uptimerobot (free for less then 50 probes) to automatically open a incident if some endpoints are not responding but it should be done carefully to avoid false incident creations.

For the impact on public facing changes, perhaps we should revive the dedicated page on the documentation: https://docs.softwareheritage.org/devel/archive-changelog.html which is currently quite outdated.

I'm also thinking since some time to a monitoring probe checking the version of the swh packages installed on the servers, creating a new grafana annotation when a new version is detected and raising an alert if some servers are outdated.

Metrics can easily be pushed to the status page.
The simple poc for the save code now request is available here : https://forge.softwareheritage.org/source/snippets/browse/master/sysadmin/status.io/update_metrics.py

The result is displayed on the status page:

  • daily stats :

  • weekly stats :

-monthly stats :

The save code now queue statistics are now displayed on the status.io page[1] as an example. The data are refreshed each 5 minutes.

[1] https://status.softwareheritage.org/