Page MenuHomeSoftware Heritage

Create an inventory of useful Munin metrics
Started, Work in Progress, HighPublic

Description

We will then add these metrics to Prometheus

Related Objects

StatusAssignedTask
OpenNone
OpenNone
Work in ProgressNone

Event Timeline

ftigeot created this task.Tue, Dec 4, 2:45 PM
ftigeot triaged this task as Normal priority.
ftigeot added a subscriber: ardumont.
ftigeot changed the task status from Open to Work in Progress.Tue, Dec 4, 4:11 PM

Disk

  • I/Os per device
  • Disk usage in percent
  • Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
  • Disk usage in absolute human values. percentages are meaningless if we resize filesystems

Networking

  • eth0 traffic

Database

  • Postgres replication lag
  • Postgres database size
  • Postgres oldest query + oldest transaction
  • Postgres scan types (sequential / indexed)
  • Postgres wal segments
  • Postgres nb. of transactions

System

  • CPU usage
  • load average
  • Memory usage
  • Pending packages
  • Swap in/out
  • Uptime

RabbitMQ

  • Consumers
  • Memory used by queue
  • Unacknowledged messages
  • Nb. of connections

Softwareheritage (prado)

  • Almost everything
  • Most importantly Software Heritage Objects
vlorentz raised the priority of this task from Normal to High.Wed, Dec 5, 5:10 PM
olasd added a subscriber: olasd.Fri, Dec 7, 3:29 PM
Munin metricCommentPrometheus metric combinationPrometheus comment
Disk
I/Os per devicenode_disk_reads_completed_total; node_disk_writes_completed_totalAdd derivative to get IOPS
Disk usage in percent (space)(node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes) / node_filesystem_size_bytesavail = available to non-root, free = available to root (tune2fs -m / reserved-blocks-percentage)
Disk usage in percent (inodes)(node_filesystem_files - node_filesystem_files_free) / node_filesystem_files
Utilization per deviceis this real ? it could be useful to see if a storage subsystem is overloadednode_disk_io_time_seconds_totaltotal time spent in seconds doing IO on the specified device; AFAICT the derivative of this counter is what munin calls "utilization per device"
node_disk_io_time_weighted_seconds_totalcounts the number of seconds spent doing IO multiplied by the number of concurrent IO requests; maybe more relevant ? Docs: https://www.kernel.org/doc/Documentation/iostats.txt
Disk usage in absolute human values.percentages are meaningless if we resize filesystemsnode_filesystem_size_bytes - node_filesystem_{avail,free}_bytesavail = available to non-root, free = available to root
Networking
eth0 trafficnode_network_receive_bytes_total; node_network_transmit_bytes_totalderivative for bytes per second
node_network_receive_packets_total; node_network_transmit_packets_totalderivative for packets per second
node_network_receive_errs_total; node_network_transmit_errs_totalalert if non-zero
Database
implemented with prometheus-sql-exporter
Postgres replication lagsql_pg_stat_replication{col=~'(send_lag_bytes,flush_lag_bytes,replay_lag_bytes)'}replace commas with pipes...
Postgres database sizesql_pg_stat_database{col="dbsize"}
Postgres oldest transactionsql_pg_stat_activity{col="max_tx_duration"}
Postgres oldest query?
Postgres scan types (sequential / indexed)sql_pg_stat_user_tables;sql_pg_statio_user_tables
Postgres wal segmentssql_archive_ready; sql_pg_stat_archiveruse derivative of sql_pg_stat_archiver values to get archival rates
Postgres nb. of transactionssql_txidderivative to get tps
System
CPU usagenode_cpu_seconds_totaluse derivative for CPU usage
load averagenode_load{1,5,15}
Memory usagenode_memory_*
Pending packagesXXXneeds to be implemented with the textfile collector (see /usr/share/doc/prometheus-node-exporter/examples/text_collector_examples/apt.sh)
Swap in/outnode_vmstat_pswpin; node_vmstat_pswpoutunit ?? probably absolute number of pages
Uptimetime() - node_boot_time_seconds
RabbitMQ
use https://github.com/kbudde/rabbitmq_exporter or https://github.com/deadtrickster/prometheus_rabbitmq_exporter
Consumers
Memory used by queue
Unacknowledged messages
Nb. of connections
Softwareheritage (prado)
Almost everythingintegrate to sql-exporter configuration
Most importantly Software Heritage Objects
vlorentz moved this task from Backlog to in progress on the Sprint 2018 12 board.Fri, Dec 7, 6:19 PM