Page MenuHomeSoftware Heritage

Investigate prometheus as an alternative to munin
Closed, MigratedEdits Locked

Description

The munin ecosystem has been faltering for some time. It becomes harder and harder to find third party plugins for the tools we're deploying, most notably the tooling around ceph and munin seems very, very poor. Writing plugins ourselves takes a lot of time that we don't really have.

The monitoring system that seems to be getting the most traction and has the most vivid ecosystem currently is prometheus. The package in Debian is well maintained, and backports are available with a high standard.

In addition to "external" monitoring through e.g. the node exporter, prometheus seems to be very good at "white-box" monitoring, where you instrument the code to provide intrinsic metrics right in the code. The python API to do so is very easy to use, and this would help getting more insight in our systems.

The main challenges of using prometheus are:

  1. reintegration of our custom metrics. Our single important custom graph is the object count graph, which is a trivial SQL query that can be added in the sql exporter.
  2. keeping historical data; backfilling is on the roadmap but not implemented (and apparently hard to implement with the current data storage backend)
  3. long-term storage: the built-in storage is not geared towards long-term storage of metrics, but rather at keeping a short history of a lot (millions) of metrics. metrics we want to keep long-term will probably need to be offloaded to some other system. We need to assess which metrics we want to keep long-term.
  4. graphing: prometheus doesn't do indexes like munin does, we have to do dashboards in an external tool such as grafana. The grafana ecosystem seems mature enough for our purposes.

The following exporters are packaged:

  • node (for OS metrics)
  • postgresql (single postgresql instance performance)
  • sql (arbitrary SQL queries)
  • pgbouncer
  • apache
  • varnish

Ceph has a built-in prometheus metrics exporter.

The following libraries are packaged:

  • python3-prometheus-client (instrumentation of Python services)
  • python3-django-prometheus (instrumentation of Django views)

The following exporters are available upstream:

Event Timeline

olasd changed the task status from Open to Work in Progress.Mar 8 2018, 3:05 PM
olasd triaged this task as High priority.
olasd created this task.

A new graph has been added, to graph the indexer progression using the sql exporter [1].

What's pleasing with prometheus/grafana already is:

  • we can specify the period we want per data extraction/computation (here 1 hour) [2]
  • separation of concern between the 2 (prometheus -> compute/store/provide data, grafana -> read and graph data).

[1] https://grafana.softwareheritage.org/d/bPlebbSiz/softwareheritage-indexer?orgId=1&from=now-2d&to=now

[2] Whereas with munin, i had to extract the computation (using cron) to be able to change the period (which is 5min by default) .
I did not find an easy way to change it otherwise. And my attempt to specify the period in the plugin directly failed.

A new graph has been added, to graph the scheduler tasks (oneshot, recurring) using the sql exporter [1].

It has been discussed the possibility to only keep the sql exporter for statistical purposes.
In the mean time, it's also used for graphing custom queries (2 for prado here, and 1 for somerset as detailed in swh-site:/data/hostsname/{somerset,prado}...yml repository).

[1] https://grafana.softwareheritage.org/d/LmGkNMNik/sofwareheritage-scheduler?orgId=1

I think we've done more than investigating at this point :-D