Page MenuHomeSoftware Heritage

Send scheduler metrics to prometheus
ClosedPublic

Authored by olasd on Sep 3 2021, 9:34 AM.

Details

Summary

Output sample of the query used within can be seen in a paste [1]

This activates the scheduler metrics on the staging scheduler db as well (in a dedicated commit).

[1] P1149

Related to T3538

Test Plan
  • bin/octo-diff on belvedere and db1.staging shows the new metrics expectedly:
$ bin/octocatalog-diff --octocatalog-diff-args --no-truncate-details --to staging belvedere
*******************************************
  File[/etc/prometheus-sql-exporter/swh-scheduler.yml] =>
   parameters =>
     content =>
      @@ -2,5 +2,5 @@
         scope: database
         cluster: secondary
      -  database: ^softwareheritage-scheduler$
      +  database: ^(swh|softwareheritage)-scheduler$
         interval: '1h'
         help: "Software Heritage Scheduler task delay spread. Positive delay for tasks whose execution is late"
      @@ -194,2 +194,26 @@
         values:
           - sum
      +
      +- name: swh_scheduler
      +  scope: database
      +  database: ^(softwareheritage|swh)-scheduler$
      +  interval: '15m'
      +  help: "Software Heritage Scheduler Metrics"
      +  query: |
      +    select l.name, l.instance_name, sm.visit_type,
      +          extract(epoch from sm.last_update) as last_update,
      +           sm.origins_known, sm.origins_enabled, sm.origins_never_visited,
      +           sm.origins_with_pending_changes
      +    from scheduler_metrics sm
      +      inner join listers l on sm.lister_id=l.id
      +    order by l.name, l.instance_name
      +  labels:
      +    - name
      +    - instance_name
      +    - visit_type
      +  values:
      +    - last_update
      +    - origins_known
      +    - origins_enabled
      +    - origins_never_visited
      +    - origins_with_pending_changes
*******************************************
*** End octocatalog-diff on belvedere.internal.softwareheritage.org
$ bin/octocatalog-diff --octocatalog-diff-args --no-truncate-details --to staging db1.internal.staging.swh.network
...
*******************************************
+ File[/etc/prometheus-sql-exporter/swh-scheduler.yml] =>
   parameters =>
      "ensure": "present"
      "group": "root"
      "mode": "0644"
      "notify": "Service[prometheus-sql-exporter]"
      "owner": "root"
      "content": >>>
- name: swh_scheduler_delay
  scope: database
  cluster: secondary
  database: ^(swh|softwareheritage)-scheduler$
  interval: '1h'
  help: "Software Heritage Scheduler task delay spread. Positive delay for tasks whose execution is late"
  query: |
    with task_count_by_bucket as (
      -- get the count of tasks by delay bucket. Tasks are grouped by their
      -- characteristics (type, status, policy, priority, current interval),
      -- then by delay buckets that are 1 hour wide between -24 and +24 hours,
      -- and 1 day wide outside of this range.
      -- A positive delay means the task execution is late wrt scheduling.
      select
        "type",
        status,
        "policy",
        priority,
        current_interval,
        (
          -- select the bucket widths
          case when delay between - 24 * 3600 and 24 * 3600 then
            (ceil(delay / 3600)::bigint) * 3600
          else
            (ceil(delay / (24 * 3600))::bigint) * 24 * 3600
          end
        ) as delay_bucket,
        count(*)
      from
        task
        join lateral (
          -- this is where the "positive = late" convention is set
          select
            extract(epoch from (now() - next_run)) as delay
        ) as d on true
        group by
          "type",
          status,
          "policy",
          priority,
          current_interval,
          delay_bucket
        order by
          "type",
          status,
          "policy",
          priority,
          current_interval,
          delay_bucket
    ),
    delay_bounds as (
      -- get the minimum and maximum delay bucket for each task group. This will
      -- let us generate all the buckets, even the empty ones in the next CTE.
      select
        "type",
        status,
        "policy",
        priority,
        current_interval,
        min(delay_bucket) as min,
        max(delay_bucket) as max
      from
        task_count_by_bucket
      group by
        "type",
        status,
        "policy",
        priority,
        current_interval
    ),
    task_buckets as (
      -- Generate all time buckets for all categories.
      select
        "type",
        status,
        "policy",
        priority,
        current_interval,
        delay_bucket
      from
        delay_bounds
        join lateral (
          -- 1 hour buckets
          select
            generate_series(- 23, 23) * 3600 as delay_bucket
          union
          -- 1 day buckets. The "- 1" is used to make sure we generate an empty
          -- bucket as lowest delay bucket, so prometheus quantile calculations
          -- stay accurate
          select
            generate_series(min / (24 * 3600) - 1, max / (24 * 3600)) * 24 * 3600 as delay_bucket
        ) as buckets on true
    ),
    task_count_for_all_buckets as (
      -- This join merges the non-empty buckets (task_count_by_bucket) with
      -- the full list of buckets (task_buckets).
      -- The join clause can't use the "using (x, y, z)" syntax, as it uses
      -- equality and priority and current_interval can be null. This also
      -- forces us to label all the fields in the select. Ugh.
      select
        task_buckets."type",
        task_buckets.status,
        task_buckets."policy",
        task_buckets.priority,
        task_buckets.current_interval,
        task_buckets.delay_bucket,
        coalesce(count, 0) as count -- make sure empty buckets have a 0 count instead of null
      from
        task_buckets
      left join task_count_by_bucket
        on task_count_by_bucket."type" = task_buckets."type"
        and task_count_by_bucket.status = task_buckets.status
        and task_count_by_bucket. "policy" = task_buckets."policy"
        and task_count_by_bucket.priority is not distinct from task_buckets.priority
        and task_count_by_bucket.current_interval is not distinct from task_buckets.current_interval
        and task_count_by_bucket.delay_bucket = task_buckets.delay_bucket
    ),
    cumulative_buckets as (
      -- Prometheus wants cumulative histograms: for each bucket, the value
      -- needs to be the total of all measurements below the given value (this
      -- allows downsampling by just throwing away some buckets). We use the
      -- "sum over partition" window function to compute this.
      -- Prometheus also expects a "+Inf" bucket for the total count. We
      -- generate it with a null le value so we can sort it after the rest of
      -- the buckets.

      -- cumulative data
      select
        "type",
        status,
        "policy",
        priority,
        current_interval,
        delay_bucket as le,
        sum(count) over (
          partition by
            "type",
            status,
            "policy",
            priority,
            current_interval
          order by
            delay_bucket
        )
      from
        task_count_for_all_buckets
      union all
      -- +Inf data
      select
        "type",
        status,
        "policy",
        priority,
        current_interval,
        null as le,
        sum(count)
      from
        task_count_for_all_buckets
      group by
        "type",
        status,
        "policy",
        priority,
        current_interval
      -- sorting of all buckets
      order by
        "type",
        status,
        "policy",
        priority,
        current_interval,
        le asc NULLS last -- make sure +Inf ends up last
    )
    -- The final query, which at this point just has to make sure that all
    -- labels are text (or the SQL exporter croaks)
    select
      -- we retrieve the backend name here as that's what we have e.g. on the celery side
      (select backend_name from task_type where cumulative_buckets."type" = task_type."type") as task,
      status::text as status,
      policy::text as policy,
      coalesce(priority::text, '') as priority,
      coalesce(current_interval::text, '') as current_interval,
      coalesce(le::text, '+Inf') as le,
      sum
    from
      cumulative_buckets
  labels:
    - task
    - status
    - policy
    - priority
    - current_interval
    - le
  values:
    - sum

- name: swh_scheduler
  scope: database
  database: ^(softwareheritage|swh)-scheduler$
  interval: '15m'
  help: "Software Heritage Scheduler Metrics"
  query: |
    select l.name, l.instance_name, sm.visit_type,
          extract(epoch from sm.last_update) as last_update,
           sm.origins_known, sm.origins_enabled, sm.origins_never_visited,
           sm.origins_with_pending_changes
    from scheduler_metrics sm
      inner join listers l on sm.lister_id=l.id
    order by l.name, l.instance_name
  labels:
    - name
    - instance_name
    - visit_type
  values:
    - last_update
    - origins_known
    - origins_enabled
    - origins_never_visited
    - origins_with_pending_changes
<<<
*******************************************
+ File[/etc/prometheus-sql-exporter/swh-storage.yml] =>
   parameters =>
      "ensure": "present"
      "group": "root"
      "mode": "0644"
      "notify": "Service[prometheus-sql-exporter]"
      "owner": "root"
      "content": >>>
- name: swh_archive_object_count
  help: Software Heritage Archive object counters
  scope: database
  cluster: main
  database: softwareheritage
  labels:
    - object_type
  values:
    - value
  query: >-
    select label as object_type, value from swh_stat_counters()
<<<
*******************************************
*** End octocatalog-diff on db1.internal.staging.swh.network

Diff Detail

Repository
rSENV Puppet Environment
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23344
Build 36425: arc lint + arc unit

Event Timeline

ardumont created this revision.

We should keep lister_name and lister_instance as separate labels. Labels don't cost anything as they're stored once per time series (and it would allow us to generate aggregated metrics for all instances of a given lister).

However, origins_known, origins_enabled, origins_never_visited, origins_with_pending_changes should be values, not labels. In the same vein, we should extract a unix timestamp from last_update and make it a metric / value, rather than have it as a label (which would change on every run).

I think prometheus-sql-exporter builds the actual metric name as {name}_{name of value}. So it would generate swh_scheduler_metrics_last_update, swh_scheduler_metrics_origins_known, … I think we can drop the metrics from that name, which is redundant.

  • Adapt according to review:
    • Distinguish correctly between labels and values
    • Make last_update a timestamp
    • Drop '_metrics' from the name as prometheus will add it itself
    • Keep name and instance_name as distinguished labels
  • db1.staging: Activate metrics for staging scheduler db
  • Drop secondary cluster as it's unnecessary a filter
  • Make the db name a regexp to be compatible both with prod and staging dbs
ardumont edited the test plan for this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)
site-modules/profile/files/prometheus/sql/config/swh-scheduler.yml
3 ↗(On Diff #22380)

You'll need to drop that one too

site-modules/profile/files/prometheus/sql/config/swh-scheduler.yml
3 ↗(On Diff #22380)

d'oh!

ardumont edited the test plan for this revision. (Show Details)

Drop the other secondary cluster instruction

Keep the _metrics in the metric name

No need for the _metrics suffix

olasd added a reviewer: ardumont.

Rebase (so the commits are pushed to the staging area which got configured recently)

Add metrics relabeling to prometheus-sql-exporter, for scheduler metrics

site-modules/profile/files/prometheus/update-prometheus-config
95–96 ↗(On Diff #22547)

here is the print i meant on irc.

This revision is now accepted and ready to land.Sep 9 2021, 5:50 PM
ardumont retitled this revision from [wip] Send scheduler metrics to prometheus to Send scheduler metrics to prometheus.Sep 9 2021, 5:50 PM
ardumont added inline comments.
site-modules/profile/files/prometheus/update-prometheus-config
95–96 ↗(On Diff #22547)

well the if must stay, but you grok the idea ;)