Investigate prometheus as an alternative to munin
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Mar 8 2018, 3:05 PM

Description

The munin ecosystem has been faltering for some time. It becomes harder and harder to find third party plugins for the tools we're deploying, most notably the tooling around ceph and munin seems very, very poor. Writing plugins ourselves takes a lot of time that we don't really have.

The monitoring system that seems to be getting the most traction and has the most vivid ecosystem currently is prometheus. The package in Debian is well maintained, and backports are available with a high standard.

In addition to "external" monitoring through e.g. the node exporter, prometheus seems to be very good at "white-box" monitoring, where you instrument the code to provide intrinsic metrics right in the code. The python API to do so is very easy to use, and this would help getting more insight in our systems.

The main challenges of using prometheus are:

reintegration of our custom metrics. Our single important custom graph is the object count graph, which is a trivial SQL query that can be added in the sql exporter.
keeping historical data; backfilling is on the roadmap but not implemented (and apparently hard to implement with the current data storage backend)
long-term storage: the built-in storage is not geared towards long-term storage of metrics, but rather at keeping a short history of a lot (millions) of metrics. metrics we want to keep long-term will probably need to be offloaded to some other system. We need to assess which metrics we want to keep long-term.
graphing: prometheus doesn't do indexes like munin does, we have to do dashboards in an external tool such as grafana. The grafana ecosystem seems mature enough for our purposes.

The following exporters are packaged:

node (for OS metrics)
postgresql (single postgresql instance performance)
sql (arbitrary SQL queries)
pgbouncer
apache
varnish

Ceph has a built-in prometheus metrics exporter.

The following libraries are packaged:

python3-prometheus-client (instrumentation of Python services)
python3-django-prometheus (instrumentation of Django views)

The following exporters are available upstream:

rabbitmq (https://github.com/kbudde/rabbitmq_exporter)
and more... https://prometheus.io/docs/instrumenting/exporters/

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T988 Investigate prometheus as an alternative to munin
		Migrated	gitlab-migration	T362 plugin to monitor the amount of visits per origin

Event Timeline

olasd changed the task status from Open to Work in Progress.Mar 8 2018, 3:05 PM

olasd triaged this task as High priority.

olasd created this task.

olasd mentioned this in rSPSITE32f22aa71247: Add prometheus data.Mar 9 2018, 4:18 PM

olasd mentioned this in rSPPROFab51a38d5830: Add prometheus configuration.

ardumont mentioned this in rSPPROF11d675f85957: prometheus: Permit extra config setup for sql exporter.Jun 8 2018, 4:53 PM

ardumont mentioned this in rSPSITEab477a3a1a65: prometheus: Permit extra config setup for sql exporter.Jun 8 2018, 4:56 PM

ardumont mentioned this in rSPSITE377b952e640a: data/defaults: Compute extra indexer time data point.

ardumont mentioned this in rSPSITEd909f8e7226c: data/defaults: Fix setup indentation.

A new graph has been added, to graph the indexer progression using the sql exporter [1].

What's pleasing with prometheus/grafana already is:

we can specify the period we want per data extraction/computation (here 1 hour) [2]
separation of concern between the 2 (prometheus -> compute/store/provide data, grafana -> read and graph data).

[1] https://grafana.softwareheritage.org/d/bPlebbSiz/softwareheritage-indexer?orgId=1&from=now-2d&to=now

[2] Whereas with munin, i had to extract the computation (using cron) to be able to change the period (which is 5min by default) .
I did not find an easy way to change it otherwise. And my attempt to specify the period in the plugin directly failed.

ardumont mentioned this in T362: plugin to monitor the amount of visits per origin.Jun 13 2018, 3:10 PM

ardumont added a subtask: T362: plugin to monitor the amount of visits per origin.

ardumont mentioned this in rSPSITE21b89d98ba4a: hostname/prado: Add prometheus queries for scheduler db.Jun 15 2018, 12:45 PM

A new graph has been added, to graph the scheduler tasks (oneshot, recurring) using the sql exporter [1].

It has been discussed the possibility to only keep the sql exporter for statistical purposes.
In the mean time, it's also used for graphing custom queries (2 for prado here, and 1 for somerset as detailed in swh-site:/data/hostsname/{somerset,prado}...yml repository).

[1] https://grafana.softwareheritage.org/d/LmGkNMNik/sofwareheritage-scheduler?orgId=1

olasd mentioned this in rSPSITEab51a38d5830: Add prometheus configuration.Jun 15 2018, 2:30 PM

ardumont mentioned this in rSPSITE11d675f85957: prometheus: Permit extra config setup for sql exporter.Jun 15 2018, 2:30 PM

I think we've done more than investigating at this point :-D

This task has been migrated to GitLab.

Investigate prometheus as an alternative to muninClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Investigate prometheus as an alternative to munin
Closed, MigratedEdits Locked
Actions

Related Objects
Search...