The munin ecosystem has been faltering for some time. It becomes harder and harder to find third party plugins for the tools we're deploying, most notably the tooling around ceph and munin seems very, very poor. Writing plugins ourselves takes a lot of time that we don't really have.
The monitoring system that seems to be getting the most traction and has the most vivid ecosystem currently is prometheus. The package in Debian is well maintained, and backports are available with a high standard.
In addition to "external" monitoring through e.g. the node exporter, prometheus seems to be very good at "white-box" monitoring, where you instrument the code to provide intrinsic metrics right in the code. The python API to do so is very easy to use, and this would help getting more insight in our systems.
The main challenges of using prometheus are:
- reintegration of our custom metrics. Our single important custom graph is the object count graph, which is a trivial SQL query that can be added in the sql exporter.
- keeping historical data; backfilling is on the roadmap but not implemented (and apparently hard to implement with the current data storage backend)
- long-term storage: the built-in storage is not geared towards long-term storage of metrics, but rather at keeping a short history of a lot (millions) of metrics. metrics we want to keep long-term will probably need to be offloaded to some other system. We need to assess which metrics we want to keep long-term.
- graphing: prometheus doesn't do indexes like munin does, we have to do dashboards in an external tool such as grafana. The grafana ecosystem seems mature enough for our purposes.
The following exporters are packaged:
- node (for OS metrics)
- postgresql (single postgresql instance performance)
- sql (arbitrary SQL queries)
Ceph has a built-in prometheus metrics exporter.
The following libraries are packaged:
- python3-prometheus-client (instrumentation of Python services)
- python3-django-prometheus (instrumentation of Django views)
The following exporters are available upstream:
- rabbitmq (https://github.com/kbudde/rabbitmq_exporter)
- and more... https://prometheus.io/docs/instrumenting/exporters/