Page MenuHomeSoftware Heritage

Federate prometheus instances through thanos
Closed, MigratedEdits Locked

Description

Thanos is the swiss-army knife for prometheus federation/HA/clustering.

It allows querying a global view of multiple, potentially redundant, prometheus data
stores, by pushing data from prometheus instances to centralised object stores, then
providing query frontends for each of these data stores.

Plan:

  • Install manual thanos services in mmca (temporary provenance server)
  • Push historical data from mmca to a thanos datastore bucket
  • Push historical data from pergamon to a thanos datastore bucket
  • D8089: Provision thanos query dedicated node (+ inventory update)
  • D8092: Expose a thanos query service to read from those datastore
  • D8097: Expose thanos gateway service to access historical data

- [ ] Expose thanos gateway on mmca (historical data access) -> will make it run on thanos node

  • D8097: Update thanos query to read from those gateways as well
  • Fix communication between thanos and pergamon nodes (firewall)
  • Fix communication between thanos and mmca nodes (certs)
  • D8143: Drop mmca's prometheus federation from puppet
  • mmca: drop history on Prometheus server (/var/lib/Prometheus/metrics2) [3]
  • mmca: Clean up historical data from bucket mmca-metrics-0 [3]
  • Switch grafana datasource from pergamon's prometheus to the thanos query service
  • Instantiate thanos sidecar service in staging cluster (then reference it to thanos node)

- [ ] Instantiate prometheus/thanos services in staging environment no more need for it since T4540

  • Instantiate prometheus/thanos services in production environment
  • Instantiate prometheus/thanos services in admin environment
  • Instantiate prometheus/thanos services in azure environment
  • Instantiate prometheus/thanos services in gitlab environment
  • Instantiate prometheus/thanos services in rancher environment
  • Federate it through thanos (puppet run on thanos node should add their grpc entries)
  • Drop pergamon's prometheus
  • Document

Draft note can be found in the hedgedoc document [2].

[1] https://thanos.io/

[2] https://hedgedoc.softwareheritage.org/X1henrmkT8yL6_W9R0YpGg?both

[3] A switch tryout to thanos' query service showed that we double the metrics since
pergamon and mmca both have the historical data (mmca's are no longer needed now since
pergamon has it through the old federation so we can drop it now)

Event Timeline

ardumont triaged this task as High priority.Jul 6 2022, 5:46 PM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Jul 11 2022, 2:21 PM
ardumont moved this task from Backlog to Weekly backlog on the System administration board.
ardumont moved this task from Weekly backlog to in-progress on the System administration board.
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)

thanos exposed on the production cluster with this commit: rSPRE8fade05553ed4a01e54e1b8481150c0e055e3f34