Page MenuHomeSoftware Heritage

Replace Munin graphs with Grafana/Prometheus dashboards
Closed, MigratedEdits Locked

Event Timeline

ftigeot triaged this task as High priority.Dec 13 2018, 4:19 PM
ftigeot created this task.
ftigeot changed the task status from Open to Work in Progress.Dec 13 2018, 4:21 PM
ftigeot moved this task from Backlog to in progress on the Sprint 2018 12 board.

Since the current status of munin's pg monitoring is inconsistent (eg. awure's dbreplica0 has not pg curve), let's get rid of all munin's pg monitors so there is less confusion, and is one step towards T1356.

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

I don't really understand why this is in the snippets repo instead of a dedicated one. Using this snippet repo, we cannot properly manage the ongoing work, do review, etc.

On the content part of this wip, I don't understand why this covers the generation of basic dashboards (for pergamon only) we already have thanks to prometheus node exporter. We want (possibly generated using grafanalib, indeed) dashboards for metrics we currently do not have in prometheus/grafana, not the ones we already have.

We can have a discussion to clarify the goals of this task, if needed.

When I asked where to put such work-in-progress, you suggested the snippets repository.

The goal of this task being to replace Munin graphs, I also aimed to reproduce the most important parts of existing Munin dashboards such as http://munin.internal.softwareheritage.org/softwareheritage.org/pergamon.softwareheritage.org/index.html#system with Prometheus / Grafana technologies.

When I asked where to put such work-in-progress, you suggested the snippets repository.

Ok then I was wrong, I guess.

The goal of this task being to replace Munin graphs, I also aimed to reproduce the most important parts of existing Munin dashboards such as http://munin.internal.softwareheritage.org/softwareheritage.org/pergamon.softwareheritage.org/index.html#system with Prometheus / Grafana technologies.

We did not discuss this in details (not enough so it seems), but when we said we want to "kill munin" (which this task is a leaf of), it means we want to be sure we have all the useful metrics we currently have in munin but in prometheus on the one hand, and on the other hand have dashboards to make these metrics available a readable and comprehensible way.

This does mean we want to duplicate munin's dashboards one-to-one. We want the data, and we want them usable (so the dashboards), not necessarily an exact reproduction of munin's views.

Sorry if that was not clear enough.

The experience you gained with grafanalib is very valuable, but what we need from there is a comprehensive list of metrics we still have in munin that are not available in any form in prometheus, then we can focus on how to present these in nice dashboards in grafana, for which grafanalib is a good solution.

Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.

Indeed it is the object of T1428. That's why I am a bit puzzled the work you have in progress does not simply target T1356. I was expecting some response to this very task in your grafanalib based code, which I did not find. So I was wondering if I missed something, that some data where still in munin only.

So: what information remains in munin we do not have in Grafana (even if the presentation differs)? If there is none (which seems to be the case), then you should focus on closing T1356 and remove munin.

Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.

It is these graphs we are still missing.

So: which one are actually useful and missing? I mean the CPU/mem/io/fs basic ones are available. We have several application specific ones too (postgresql, kafka, etc.) as well as, recently, swh specific ones (scheduler, storage, objstorage and celery workers related ones).
So what do we need we do not have today?

And for the missing ones, according we do have the data in prometheus, how critical/urgent is it to have them in fancy dashboards now? (compared to, say, a proper preproduction/test platform).

What prevent us from killing munin now?

If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.

Apart from that, we are missing a few things which I consider somewhat useful:

  • swap numbers in the memory usage graph
  • swap in/out transfer volumes
  • uptime
  • nb. of device i/o requests per second
  • storage device latency
  • storage device throughput
  • network interface traffic
  • eventually network interface errors

Prometheus does not provide storage device statistics for Proxmox container-based hosts.
The data can be read from their parent machine dashboards though.

Grafanalib dashboards added to https://grafana.softwareheritage.org/ via the new provisioning mechanism of Grafana 5.x.
Fully automated provisioning is still a work-in-progress.

Any chance we can close this now?

ftigeot claimed this task.

I do not see any missing piece in the Grafana dashboard, the Munin graph service/VM can be shut down.