We will then add these metrics to Prometheus
Description
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1408 More/better Metrics | ||
Migrated | gitlab-migration | T1356 Kill munin | ||
Migrated | gitlab-migration | T1428 Create an inventory of useful Munin metrics |
Event Timeline
Comment Actions
Disk
- I/Os per device
- Disk usage in percent
- Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
- Disk usage in absolute human values. percentages are meaningless if we resize filesystems
Networking
- eth0 traffic
Database
- Postgres replication lag
- Postgres database size
- Postgres oldest query + oldest transaction
- Postgres scan types (sequential / indexed)
- Postgres wal segments
- Postgres nb. of transactions
System
- CPU usage
- load average
- Memory usage
- Pending packages
- Swap in/out
- Uptime
RabbitMQ
- Consumers
- Memory used by queue
- Unacknowledged messages
- Nb. of connections
Softwareheritage (prado)
- Almost everything
- Most importantly Software Heritage Objects
Comment Actions
Munin metric | Comment | Prometheus metric combination | Prometheus comment |
Disk | |||
I/Os per device | node_disk_reads_completed_total; node_disk_writes_completed_total | Add derivative to get IOPS | |
Disk usage in percent (space) | (node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes) / node_filesystem_size_bytes | avail = available to non-root, free = available to root (tune2fs -m / reserved-blocks-percentage) | |
Disk usage in percent (inodes) | (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files | ||
Utilization per device | is this real ? it could be useful to see if a storage subsystem is overloaded | node_disk_io_time_seconds_total | total time spent in seconds doing IO on the specified device; AFAICT the derivative of this counter is what munin calls "utilization per device" |
node_disk_io_time_weighted_seconds_total | counts the number of seconds spent doing IO multiplied by the number of concurrent IO requests; maybe more relevant ? Docs: https://www.kernel.org/doc/Documentation/iostats.txt | ||
Disk usage in absolute human values. | percentages are meaningless if we resize filesystems | node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes | avail = available to non-root, free = available to root |
Networking | |||
eth0 traffic | node_network_receive_bytes_total; node_network_transmit_bytes_total | derivative for bytes per second | |
node_network_receive_packets_total; node_network_transmit_packets_total | derivative for packets per second | ||
node_network_receive_errs_total; node_network_transmit_errs_total | alert if non-zero | ||
Database | |||
implemented with prometheus-sql-exporter | |||
Postgres replication lag | sql_pg_stat_replication{col=~'(send_lag_bytes,flush_lag_bytes,replay_lag_bytes)'} | replace commas with pipes... | |
Postgres database size | sql_pg_stat_database{col="dbsize"} | ||
Postgres oldest transaction | sql_pg_stat_activity{col="max_tx_duration"} | ||
Postgres oldest query | ? | ||
Postgres scan types (sequential / indexed) | sql_pg_stat_user_tables;sql_pg_statio_user_tables | ||
Postgres wal segments | sql_archive_ready; sql_pg_stat_archiver | use derivative of sql_pg_stat_archiver values to get archival rates | |
Postgres nb. of transactions | sql_txid | derivative to get tps | |
System | |||
CPU usage | node_cpu_seconds_total | use derivative for CPU usage | |
load average | node_load{1,5,15} | ||
Memory usage | node_memory_* | ||
Pending packages | XXX | needs to be implemented with the textfile collector (see /usr/share/doc/prometheus-node-exporter/examples/text_collector_examples/apt.sh) | |
Swap in/out | node_vmstat_pswpin; node_vmstat_pswpout | unit ?? probably absolute number of pages | |
Uptime | time() - node_boot_time_seconds | ||
RabbitMQ | |||
use https://github.com/kbudde/rabbitmq_exporter or https://github.com/deadtrickster/prometheus_rabbitmq_exporter | |||
Consumers | |||
Memory used by queue | |||
Unacknowledged messages | |||
Nb. of connections | |||
Softwareheritage (prado) | |||
Almost everything | integrate to sql-exporter configuration | ||
Most importantly Software Heritage Objects |