Page MenuHomeSoftware Heritage

Federate prometheus instances through thanos
Started, Work in Progress, HighPublic

Description

Thanos is the swiss-army knife for prometheus federation/HA/clustering.

It allows querying a global view of multiple, potentially redundant, prometheus data
stores, by pushing data from prometheus instances to centralised object stores, then
providing query frontends for each of these data stores.

Plan:

  • Install manual thanos services in mmca (temporary provenance server)
  • Push historical data from mmca to a thanos datastore bucket
  • Push historical data from pergamon to a thanos datastore bucket
  • D8089: Provision thanos query dedicated node (+ inventory update)
  • D8092: Expose a thanos query service to read from those datastore
  • D8097: Expose thanos gateway service to access historical data

- [ ] Expose thanos gateway on mmca (historical data access) -> will make it run on thanos node

  • D8097: Update thanos query to read from those gateways as well
  • Fix communication between thanos and pergamon nodes (firewall)
  • Fix communication between thanos and mmca nodes (certs)
  • D8143: Drop mmca's prometheus federation from puppet
  • mmca: drop history on Prometheus server (/var/lib/Prometheus/metrics2) [3]
  • mmca: Clean up historical data from bucket mmca-metrics-0 [3]
  • Switch grafana datasource from pergamon's prometheus to the thanos query service
  • Instantiate prometheus/thanos services in staging environment
  • Instantiate prometheus/thanos services in admin environment
  • Instantiate prometheus/thanos services in production environment
  • Instantiate prometheus/thanos services in azure environment
  • Federate it through thanos (puppet run on thanos node should add their grpc entries)
  • Drop pergamon's prometheus
  • Document

Draft note can be found in the hedgedoc document [2].

[1] https://thanos.io/

[2] https://hedgedoc.softwareheritage.org/X1henrmkT8yL6_W9R0YpGg?both

[3] A switch tryout to thanos' query service showed that we double the metrics since
pergamon and mmca both have the historical data (mmca's are no longer needed now since
pergamon has it through the old federation so we can drop it now)

Event Timeline

ardumont triaged this task as High priority.Jul 6 2022, 5:46 PM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Mon, Jul 11, 2:21 PM
ardumont moved this task from Backlog to Weekly backlog on the System administration board.
ardumont moved this task from Weekly backlog to in-progress on the System administration board.
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)