A search query to show candidates for this label: https://forge.softwareheritage.org/maniphest/query/Z8t6ZkjPI7JV/ (ie. it filters out high-priority tasks, tasks with someone already assigned to it, and sysadmin tasks)
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 13 2018
Dec 12 2018
Feature request from epol on IRC: being able to selectively listen only for the graph and not the object changes. (I guess this will be the case anyway, I'm just mentioning that feedback on the task.)
Dec 11 2018
Added draft dashboard: https://grafana.softwareheritage.org/d/3SAW_JEmk/software-heritage-archive-counters
As software is mostly declared in puppet, I think the main areas that could be improved would be
- hardware inventory
- network topology
- puppet reports integration
Dec 10 2018
ardumont raised the priority of this task from Normal to High.
That won't capture all events, but the easiest solution is to use Jenkins' RSS: https://jenkins.softwareheritage.org/view/swh/rssAll
Dec 7 2018
Munin metric | Comment | Prometheus metric combination | Prometheus comment |
Disk | |||
I/Os per device | node_disk_reads_completed_total; node_disk_writes_completed_total | Add derivative to get IOPS | |
Disk usage in percent (space) | (node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes) / node_filesystem_size_bytes | avail = available to non-root, free = available to root (tune2fs -m / reserved-blocks-percentage) | |
Disk usage in percent (inodes) | (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files | ||
Utilization per device | is this real ? it could be useful to see if a storage subsystem is overloaded | node_disk_io_time_seconds_total | total time spent in seconds doing IO on the specified device; AFAICT the derivative of this counter is what munin calls "utilization per device" |
node_disk_io_time_weighted_seconds_total | counts the number of seconds spent doing IO multiplied by the number of concurrent IO requests; maybe more relevant ? Docs: https://www.kernel.org/doc/Documentation/iostats.txt | ||
Disk usage in absolute human values. | percentages are meaningless if we resize filesystems | node_filesystem_size_bytes - node_filesystem_{avail,free}_bytes | avail = available to non-root, free = available to root |
Networking | |||
eth0 traffic | node_network_receive_bytes_total; node_network_transmit_bytes_total | derivative for bytes per second | |
node_network_receive_packets_total; node_network_transmit_packets_total | derivative for packets per second | ||
node_network_receive_errs_total; node_network_transmit_errs_total | alert if non-zero | ||
Database | |||
implemented with prometheus-sql-exporter | |||
Postgres replication lag | sql_pg_stat_replication{col=~'(send_lag_bytes,flush_lag_bytes,replay_lag_bytes)'} | replace commas with pipes... | |
Postgres database size | sql_pg_stat_database{col="dbsize"} | ||
Postgres oldest transaction | sql_pg_stat_activity{col="max_tx_duration"} | ||
Postgres oldest query | ? | ||
Postgres scan types (sequential / indexed) | sql_pg_stat_user_tables;sql_pg_statio_user_tables | ||
Postgres wal segments | sql_archive_ready; sql_pg_stat_archiver | use derivative of sql_pg_stat_archiver values to get archival rates | |
Postgres nb. of transactions | sql_txid | derivative to get tps | |
System | |||
CPU usage | node_cpu_seconds_total | use derivative for CPU usage | |
load average | node_load{1,5,15} | ||
Memory usage | node_memory_* | ||
Pending packages | XXX | needs to be implemented with the textfile collector (see /usr/share/doc/prometheus-node-exporter/examples/text_collector_examples/apt.sh) | |
Swap in/out | node_vmstat_pswpin; node_vmstat_pswpout | unit ?? probably absolute number of pages | |
Uptime | time() - node_boot_time_seconds | ||
RabbitMQ | |||
use https://github.com/kbudde/rabbitmq_exporter or https://github.com/deadtrickster/prometheus_rabbitmq_exporter | |||
Consumers | |||
Memory used by queue | |||
Unacknowledged messages | |||
Nb. of connections | |||
Softwareheritage (prado) | |||
Almost everything | integrate to sql-exporter configuration | ||
Most importantly Software Heritage Objects |
Dec 6 2018
what kind of inventory we want to do with this? hardware? software? both?
Dec 5 2018
For the loader mercurial, there is a module swh.loader.mercurial.loader_verifier which is not production code.
It's there to test the loader manually, so that could either be probably moved to the tests and transformed into it or removed altogether.
Dec 4 2018
Disk
- I/Os per device
- Disk usage in percent
- Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
- Disk usage in absolute human values. percentages are meaningless if we resize filesystems
Tip: after running Tox in a repo, run coverage report -m to show which lines are not covered.
Duplicate of T1421