Reliable monitoring of services: for users and for admins
Closed, MigratedEdits Locked
Actions

Description

This meta-task tracks activities geared towards building reliable monitoring indicators of the Software Heritage services for our users and for our own admins. Every service disruption should be tracked clearly, and avoid messages on IRC saying "oh, it's normal that XXX does not work, we did YYY some weeks ago"

This involves:

ensuring that status.softwareheritage.org is always faithfully representing the operational status of the infrastructures
refining if necessary the list of services on status.softwareheritage.org
clear planning announcements of scheduled downtime or changes to APIs/WebApp or any other user-visible feature
add missing monitoring points as needed

On the admin side, we also need to clearly identify the key indicators we want to follow, and reduce noise on them: these indicators should all be green during normal operation, and only show alerts that are meaningful and require intervention.

Revisions and Commits

rSPSITE puppet-swh-site
	D7844	rSPSITE8c8590ef6fad Allow icinga checks to write prometheus metrics files
	D5787	rSPSITE9c01d2124948 status.io: push save code now statistics
rDICP Icinga plugins
	D7807	rDICP11f9eae84707 Remove the wrong dependency added in the previous commit
	D6926	rDICP9812ac8f7b1d First iteration of prometheus export of the e2e metrics

Related Objects

Mentioned In: rSPSITEa7c2fc6b65a1: icinga checks: Activate the prometheus export on e2e tests
rDICPb0a683a07b41: Declare the prometheus client dependency
D7764: icinga checks: Activate the prometheus export on e2e tests
T4133: Make lag monitoring dashboards easy to find
rSPSITE705f4d26a234: status.io: fix api credentials
rSPPRIVCc5692f8dbafc: Add censored status.io api credentials
rCJSWHbe7d718f636b: jobs/dependency-packages: change python3-statusio display name
rCJSWHd241c051ac0d: jobs/dependency-packages: Add statusio-python package
rDSNIP8a5d814541aa: status.io: configure the script via parameters
rDSNIPa212bc6fa643: POC status.io's metrics

Event Timeline

rdicosmo created this task.Mar 14 2021, 8:17 PM

vlorentz triaged this task as Normal priority.Mar 15 2021, 12:29 PM

vsellier claimed this task.Apr 23 2021, 3:13 PM

vsellier changed the task status from Open to Work in Progress.May 20 2021, 12:01 PM

for the status.swh.org point of view, status.io is providing some api endpoint to push metrics. It should be possible to add some metrics (up to 10 with our plan) to expose the behavior of the platform (daily/weekly and monthly statistics).
As a first step, we could expose the number of pending save code now requests and the number of origin visits to have some live data. An example of a status page with metrics : https://status.docker.com/
I'm working on a code snippet to test the integration feasibility/complexity.

For the user facing services, the vault service is not exposed on status page.
We could also add an integration with uptimerobot (free for less then 50 probes) to automatically open a incident if some endpoints are not responding but it should be done carefully to avoid false incident creations.

For the impact on public facing changes, perhaps we should revive the dedicated page on the documentation: https://docs.softwareheritage.org/devel/archive-changelog.html which is currently quite outdated.

I'm also thinking since some time to a monitoring probe checking the version of the swh packages installed on the servers, creating a new grafana annotation when a new version is detected and raising an alert if some servers are outdated.

vsellier mentioned this in rDSNIPa212bc6fa643: POC status.io's metrics.May 25 2021, 9:13 AM

Metrics can easily be pushed to the status page.
The simple poc for the save code now request is available here : https://forge.softwareheritage.org/source/snippets/browse/master/sysadmin/status.io/update_metrics.py

The result is displayed on the status page:

daily stats :

weekly stats :

-monthly stats :

vsellier mentioned this in rDSNIP8a5d814541aa: status.io: configure the script via parameters.May 26 2021, 10:03 AM

vsellier mentioned this in rCJSWHd241c051ac0d: jobs/dependency-packages: Add statusio-python package.May 26 2021, 1:48 PM

vsellier mentioned this in rCJSWHbe7d718f636b: jobs/dependency-packages: change python3-statusio display name.May 26 2021, 1:50 PM

vsellier mentioned this in rSPPRIVCc5692f8dbafc: Add censored status.io api credentials.May 26 2021, 5:00 PM

vsellier added a revision: D5787: status.io: push save code now statistics.May 26 2021, 5:06 PM

vsellier added a commit: rSPSITE9c01d2124948: status.io: push save code now statistics.May 27 2021, 9:10 AM

vsellier mentioned this in rSPSITE705f4d26a234: status.io: fix api credentials.May 27 2021, 10:46 AM

The save code now queue statistics are now displayed on the status.io page[1] as an example. The data are refreshed each 5 minutes.

[1] https://status.softwareheritage.org/

great ;)

Current status:
Following the last discussions, the current track I'm trying to implement is to create a grafana dashboard displaying the current status of the infrastructure.
To do so, some information managed by grafana should be displayed like the end-to-end checks status.

several options are currently indentified :

create an icinga datasource for grafana [3]
send icinga stats to prometheus [1]
export icinga stats to an influxdb database [2] and use them in grafana

[1] https://github.com/opsdis/monitor-promdiscovery and https://github.com/opsdis/icinga2-exporter
[2] https://grafana.com/grafana/dashboards/381
[3] https://grafana.com/tutorials/build-a-data-source-plugin/

The 2/ doesn't seems to be a viable solution as it concerns only performance statistics and it's probably not enough for our needs
The 1/ and 3/ look possible but introduce some complexity in the infra (1/ maintenance and compatibility risk when icinga/grafana version changes, 2/ a new component to operate on the infra and will stored duplicate information between icinga itself and the prometheus exporters data)

Perhaps the first step is to mockup the dashboard to identify the probes we want to display and check how they can be integrated. For examples, exporting the data to prometheus directly from the end to end checks could be possible.

ardumont updated the task description. (Show Details)Jan 10 2022, 10:09 AM

vsellier added a revision: D6926: First iteration of prometheus export of the e2e metrics.Jan 12 2022, 3:07 PM

bchauvet added a project: Roadmap 2022.Mar 23 2022, 5:10 PM

bchauvet raised the priority of this task from Normal to High.Mar 25 2022, 5:26 PM

bchauvet mentioned this in T4133: Make lag monitoring dashboards easy to find.Apr 11 2022, 11:47 AM

vsellier mentioned this in D7764: icinga checks: Activate the prometheus export on e2e tests.May 6 2022, 5:29 PM

vsellier added a commit: rDICP9812ac8f7b1d: First iteration of prometheus export of the e2e metrics.May 10 2022, 9:16 AM

vsellier mentioned this in rDICPb0a683a07b41: Declare the prometheus client dependency.May 10 2022, 2:21 PM

vsellier added a revision: D7807: Remove the wrong dependency added in the previous commit.May 10 2022, 6:08 PM

vsellier added a commit: rDICP11f9eae84707: Remove the wrong dependency added in the previous commit.May 10 2022, 6:11 PM

vsellier added a revision: D7844: Allow icinga checks to write prometheus metrics files.May 17 2022, 8:24 PM

vsellier added a commit: rSPSITE8c8590ef6fad: Allow icinga checks to write prometheus metrics files.May 19 2022, 2:09 PM