Page MenuHomeSoftware Heritage

Reliable monitoring of services: for users and for admins
Closed, MigratedEdits Locked


This meta-task tracks activities geared towards building reliable monitoring indicators of the Software Heritage services for our users and for our own admins. Every service disruption should be tracked clearly, and avoid messages on IRC saying "oh, it's normal that XXX does not work, we did YYY some weeks ago"

This involves:

  • ensuring that is always faithfully representing the operational status of the infrastructures
  • refining if necessary the list of services on
  • clear planning announcements of scheduled downtime or changes to APIs/WebApp or any other user-visible feature
  • add missing monitoring points as needed

On the admin side, we also need to clearly identify the key indicators we want to follow, and reduce noise on them: these indicators should all be green during normal operation, and only show alerts that are meaningful and require intervention.

Event Timeline

vlorentz triaged this task as Normal priority.Mar 15 2021, 12:29 PM
vsellier changed the task status from Open to Work in Progress.May 20 2021, 12:01 PM

for the point of view, is providing some api endpoint to push metrics. It should be possible to add some metrics (up to 10 with our plan) to expose the behavior of the platform (daily/weekly and monthly statistics).
As a first step, we could expose the number of pending save code now requests and the number of origin visits to have some live data. An example of a status page with metrics :
I'm working on a code snippet to test the integration feasibility/complexity.

For the user facing services, the vault service is not exposed on status page.
We could also add an integration with uptimerobot (free for less then 50 probes) to automatically open a incident if some endpoints are not responding but it should be done carefully to avoid false incident creations.

For the impact on public facing changes, perhaps we should revive the dedicated page on the documentation: which is currently quite outdated.

I'm also thinking since some time to a monitoring probe checking the version of the swh packages installed on the servers, creating a new grafana annotation when a new version is detected and raising an alert if some servers are outdated.

Metrics can easily be pushed to the status page.
The simple poc for the save code now request is available here :

The result is displayed on the status page:

  • daily stats :

  • weekly stats :

-monthly stats :

The save code now queue statistics are now displayed on the page[1] as an example. The data are refreshed each 5 minutes.


Current status:
Following the last discussions, the current track I'm trying to implement is to create a grafana dashboard displaying the current status of the infrastructure.
To do so, some information managed by grafana should be displayed like the end-to-end checks status.

several options are currently indentified :

  1. create an icinga datasource for grafana [3]
  2. send icinga stats to prometheus [1]
  3. export icinga stats to an influxdb database [2] and use them in grafana

[1] and

The 2/ doesn't seems to be a viable solution as it concerns only performance statistics and it's probably not enough for our needs
The 1/ and 3/ look possible but introduce some complexity in the infra (1/ maintenance and compatibility risk when icinga/grafana version changes, 2/ a new component to operate on the infra and will stored duplicate information between icinga itself and the prometheus exporters data)

Perhaps the first step is to mockup the dashboard to identify the probes we want to display and check how they can be integrated. For examples, exporting the data to prometheus directly from the end to end checks could be possible.

bchauvet raised the priority of this task from Normal to High.Mar 25 2022, 5:26 PM