Page MenuHomeSoftware Heritage

Create an inventory of useful Munin metrics
Closed, ResolvedPublic

Description

We will then add these metrics to Prometheus

Event Timeline

ftigeot triaged this task as Normal priority.Dec 4 2018, 2:45 PM
ftigeot created this task.
ftigeot added a subscriber: ardumont.
ftigeot changed the task status from Open to Work in Progress.Dec 4 2018, 4:11 PM

Disk

  • I/Os per device
  • Disk usage in percent
  • Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
  • Disk usage in absolute human values. percentages are meaningless if we resize filesystems

Networking

  • eth0 traffic

Database

  • Postgres replication lag
  • Postgres database size
  • Postgres oldest query + oldest transaction
  • Postgres scan types (sequential / indexed)
  • Postgres wal segments
  • Postgres nb. of transactions

System

  • CPU usage
  • load average
  • Memory usage
  • Pending packages
  • Swap in/out
  • Uptime

RabbitMQ

  • Consumers
  • Memory used by queue
  • Unacknowledged messages
  • Nb. of connections

Softwareheritage (prado)

  • Almost everything
  • Most importantly Software Heritage Objects
vlorentz raised the priority of this task from Normal to High.Dec 5 2018, 5:10 PM
olasd added a subscriber: olasd.Dec 7 2018, 3:29 PM
Munin metricCommentPrometheus metric combinationPrometheus comment
Disk
I/Os per devicenode_disk_reads_completed_total; node_disk_writes_completed_totalAdd derivative to get IOPS
Disk usage in percent (space)(node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes) / node_filesystem_size_bytesavail = available to non-root, free = available to root (tune2fs -m / reserved-blocks-percentage)
Disk usage in percent (inodes)(node_filesystem_files - node_filesystem_files_free) / node_filesystem_files
Utilization per deviceis this real ? it could be useful to see if a storage subsystem is overloadednode_disk_io_time_seconds_totaltotal time spent in seconds doing IO on the specified device; AFAICT the derivative of this counter is what munin calls "utilization per device"
node_disk_io_time_weighted_seconds_totalcounts the number of seconds spent doing IO multiplied by the number of concurrent IO requests; maybe more relevant ? Docs: https://www.kernel.org/doc/Documentation/iostats.txt
Disk usage in absolute human values.percentages are meaningless if we resize filesystemsnode_filesystem_size_bytes - node_filesystem_{avail,free}_bytesavail = available to non-root, free = available to root
Networking
eth0 trafficnode_network_receive_bytes_total; node_network_transmit_bytes_totalderivative for bytes per second
node_network_receive_packets_total; node_network_transmit_packets_totalderivative for packets per second
node_network_receive_errs_total; node_network_transmit_errs_totalalert if non-zero
Database
implemented with prometheus-sql-exporter
Postgres replication lagsql_pg_stat_replication{col=~'(send_lag_bytes,flush_lag_bytes,replay_lag_bytes)'}replace commas with pipes...
Postgres database sizesql_pg_stat_database{col="dbsize"}
Postgres oldest transactionsql_pg_stat_activity{col="max_tx_duration"}
Postgres oldest query?
Postgres scan types (sequential / indexed)sql_pg_stat_user_tables;sql_pg_statio_user_tables
Postgres wal segmentssql_archive_ready; sql_pg_stat_archiveruse derivative of sql_pg_stat_archiver values to get archival rates
Postgres nb. of transactionssql_txidderivative to get tps
System
CPU usagenode_cpu_seconds_totaluse derivative for CPU usage
load averagenode_load{1,5,15}
Memory usagenode_memory_*
Pending packagesXXXneeds to be implemented with the textfile collector (see /usr/share/doc/prometheus-node-exporter/examples/text_collector_examples/apt.sh)
Swap in/outnode_vmstat_pswpin; node_vmstat_pswpoutunit ?? probably absolute number of pages
Uptimetime() - node_boot_time_seconds
RabbitMQ
use https://github.com/kbudde/rabbitmq_exporter or https://github.com/deadtrickster/prometheus_rabbitmq_exporter
Consumers
Memory used by queue
Unacknowledged messages
Nb. of connections
Softwareheritage (prado)
Almost everythingintegrate to sql-exporter configuration
Most importantly Software Heritage Objects
vlorentz moved this task from Backlog to in progress on the Sprint 2018 12 board.Dec 7 2018, 6:19 PM
olasd moved this task from in progress to done on the Sprint 2018 12 board.Dec 19 2018, 4:19 PM
ftigeot closed this task as Resolved.Mar 20 2019, 11:43 AM
ftigeot claimed this task.

Already marked as done on 2018-12-19.