Page MenuHomeSoftware Heritage

Configure thanos query service to new dedicated node
ClosedPublic

Authored by ardumont on Jul 7 2022, 11:48 AM.

Details

Summary

This service will become the new datasource for grafana.

It will transparently query data to the multiple datastores (filled with prometheus
metrics).

Currently that means pergamon and mmca, more are on their way.

Related to T4385

Test Plan

vagrant up thanos:

  • puppet applies
  • check service is ok [1].
  • check query interface ui is also ok [2].

For [2], I guess the data exposed are the actual one from mmca (as we don't have any overridable setup for that node in puppet).

[1]

$ vagrant ssh thanos > sudo -i
root@thanos:~# ip a | grep 10.168
    inet 10.168.50.90/16 brd 10.168.255.255 scope global eth1
root@thanos:~# ss -tanp | grep 1919*
LISTEN   0      4096                       *:19191                   *:*     users:(("thanos",pid=24042,fd=8))         
ESTAB    0      0      [::ffff:10.168.50.90]:19191 [::ffff:10.168.0.1]:53606 users:(("thanos",pid=24042,fd=11))        
ESTAB    0      0      [::ffff:10.168.50.90]:19191 [::ffff:10.168.0.1]:53610 users:(("thanos",pid=24042,fd=13))        
ESTAB    0      0      [::ffff:10.168.50.90]:19191 [::ffff:10.168.0.1]:53608 users:(("thanos",pid=24042,fd=12))        
ESTAB    0      0      [::ffff:10.168.50.90]:19191 [::ffff:10.168.0.1]:53612 users:(("thanos",pid=24042,fd=14))        
root@thanos:~# systemctl status thanos-query | grep running
     Active: active (running) since Thu 2022-07-07 09:42:15 UTC; 4min 37s ago
root@thanos:~# systemctl cat thanos-query | grep -C3 "thanos query"
[Service]
Restart=on-failure
User=prometheus
ExecStart=/opt/thanos/current/thanos query --http-address 0.0.0.0:19191 \
    --store pergamon.internal.softwareheritage.org:19090 \
    --store mmca.softwareheritage.org:19090
ExecReload=/bin/kill -HUP $MAINPID

octo-diff on pergamon:

...
*******************************************
  File[/etc/systemd/system/thanos-sidecar.service] =>
   parameters =>
     content =>
      @@ -12,8 +12,8 @@
       ExecStart=/opt/thanos/current/thanos sidecar --tsdb.path=/var/lib/prometheus/metrics2 \
           --prometheus.url=http://192.168.100.29:9090/ \
      -    --objstore.config-file=/etc/thanos-sidecar/objstore.yml \
      +    --objstore.config-file=/etc/thanos/objstore.yml \
           --shipper.upload-compacted \
      -    --http-address=0.0.0.0:19191 \
      -    --grpc-address=0.0.0.0:19090
      +    --http-address=192.168.100.29:19191 \
      +    --grpc-address=192.168.100.29:19090
       ExecReload=/bin/kill -HUP $MAINPID
       TimeoutStopSec=20s
*******************************************
- File[/etc/thanos-sidecar/objstore.yml]
*******************************************
- File[/etc/thanos-sidecar]
*******************************************
+ File[/etc/thanos/objstore.yml] =>
   parameters =>
      "ensure": "present"
      "group": "prometheus"
      "mode": "0640"
      "owner": "root"
      "content": >>>
# File managed by puppet - modifications will be lost
type: AZURE
config:
  storage_account: swhthanosmetrics
  storage_account_key: ''
  container: metrics-historical-data-0
<<<
*******************************************
+ File[/etc/thanos] =>
   parameters =>
      "ensure": "directory"
      "group": "prometheus"
      "mode": "0750"
      "owner": "root"
*******************************************
+ Profile::Thanos::Export_query_endpoint[thanos-sidecar-pergamon.softwareheritage.org] =>
   parameters =>
      "grpc_address": "192.168.100.29:19090"
*******************************************
  Systemd::Unit_file[thanos-sidecar.service] =>
   parameters =>
     content =>
      @@ -12,8 +12,8 @@
       ExecStart=/opt/thanos/current/thanos sidecar --tsdb.path=/var/lib/prometheus/metrics2 \
           --prometheus.url=http://192.168.100.29:9090/ \
      -    --objstore.config-file=/etc/thanos-sidecar/objstore.yml \
      +    --objstore.config-file=/etc/thanos/objstore.yml \
           --shipper.upload-compacted \
      -    --http-address=0.0.0.0:19191 \
      -    --grpc-address=0.0.0.0:19090
      +    --http-address=192.168.100.29:19191 \
      +    --grpc-address=192.168.100.29:19090
       ExecReload=/bin/kill -HUP $MAINPID
       TimeoutStopSec=20s
*******************************************
*** End octocatalog-diff on pergamon.softwareheritage.org

Diff Detail

Repository
rSPSITE puppet-swh-site
Branch
staging
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 30398
Build 47520: arc lint + arc unit

Event Timeline

ardumont edited the test plan for this revision. (Show Details)
ardumont added inline comments.
site-modules/profile/manifests/thanos/base.pp
43

Got a warning with vagrant saying that variable is not defined.

site-modules/profile/manifests/thanos/query.pp
16

used this as we can't use a hashmap here, same 'store' key for multiple store.

olasd added inline comments.
site-modules/profile/manifests/thanos/prometheus_sidecar.pp
12–14

These variable names should be specific to the sidecar.

33–35

While we're updating these, we should probably make them listen to the internal address instead of all addreses.

site-modules/profile/manifests/thanos/query.pp
8

This port should be specific to the query service.

9–10

I think the stores variable should be full grpc targets (host + port) rather than separate the variable. For instance, the mmca target will have a different port (443) than all others.

12–16

Overall I think this should be using the file_sd mechanism rather than a hardcoded list of entries in the CLI: generate a yaml file with a list of targets on the fly.

Among other things, this allows updating the list of stores with just a reload of the service rather than a full restart.

16

We could update flatten_to_argument_list to turn arrays into multiple instance of the same cli flag.

site-modules/profile/templates/thanos/thanos-query.service.erb
9

I believe this could use DynamicUser=yes, as it'll only need to read a couple of files.

site-modules/role/manifests/swh_thanos_query.pp
2 ↗(On Diff #29203)

As this will eventually have both the query and store gateways, I guess it could just be called role::swh_thanos.

site-modules/profile/manifests/thanos/query.pp
12–16

And, eventually, these could be pulled from exported resources (the same way we manage prometheus targets)

ardumont added inline comments.
site-modules/profile/manifests/thanos/query.pp
12–16

ack, i'll see what i can do about this:

  • thanos {gateway, sidecar} services as exported resources
  • collect those resources from the "thanos" node
  • derive the file_sd configuration file out of those resources
  • adapt the thanos query cli to use that file ^
ardumont marked 3 inline comments as done and 2 inline comments as done.Jul 7 2022, 3:20 PM

Address the simple part of the comments (had a bit of a fight with vagrant):

  • sidecar: dedicate http and grpc port
  • sidecar: expose only internal ip
  • query: dedicate http and grpc port
  • systemd service: use dynamic user
  • stores list: use full $service:$port string

wip: Rework logic to instantiate thanos query to lookup datastore (out of collected
resources and statically declared resources)

site-modules/profile/manifests/thanos/query.pp
47

Having a hard time determining how to read the collected resource to properly extend the config_filepath.

Use collected resources to generate the configuration file for thanos query service

site-modules/profile/manifests/thanos/prometheus_sidecar.pp
67–77

This one should be a concrete resource on the profile::thanos::query end. If you make it an exported resource, once we have multiple hosts with a sidecar the thanos host will get multiple identical definitions which will be a problem.

78–84

This is fine, but it's a bit leaky in terms of implementation details.

To be a bit more future-proof, we could have a pair of defined resources:

  • a profile::thanos::query_endpoint resource, which would contain the actual concat fragment
  • a profile::thanos::export_query_endpoint resource, which would contain an exported @@profile::thanos::query_endpoint.

This would also avoid having to pull an unrelated $config_filepath variable in the prometheus sidecar class.

The object store gateways would then also be able to export query endpoints (to another host) or define query endpoints (on the local host).

This way, we can change the way we handle the collection of query endpoints without having to touch the multiple classes which will define some.

(You can have a look at the two defined resources profile::prometheus::{scrape_config,export_scrape_config} for another instance of this pattern)

Adapt according to suggestion

Stop duplication of config-dir creation (+ Separation of concern)

Looks good to me, thanks!

This revision is now accepted and ready to land.Jul 12 2022, 10:11 AM
site-modules/profile/templates/thanos/thanos-query.service.erb
9

i did that and then reverted it but i don't remember why.