Page MenuHomeSoftware Heritage
Feed Advanced Search

Apr 21 2020

olasd closed T1270: Investigate an application monitoring tool to automate error detection in our workers as Resolved.

I'm pretty sure this is done now ;p

Apr 21 2020, 11:36 AM · Metrics/monitoring, Development environment

Feb 15 2020

vlorentz moved T2175: Deploy swh-icinga-plugins from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz moved T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 27 2020

vlorentz added a comment to T1365: Archive coverage metrics in prometheus.

https://grafana.softwareheritage.org/d/3SAW_JEmk/software-heritage-archive-counters

Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus, a subtask of T1364: Have production metrics in prometheus or kibana, as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project

Jan 23 2020

ardumont closed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services as Resolved.

Deployed.

Jan 23 2020, 12:09 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont added a parent task for T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: T2238: Configure Sentry environments.
Jan 23 2020, 11:13 AM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 22 2020

ardumont added a revision to T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: D2576: sentry: Define setup for swh services (servers, workers, ...).
Jan 22 2020, 6:50 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz added a project to T2228: Metrics and monitoring: Metrics/monitoring.
Jan 22 2020, 4:27 PM · Metrics/monitoring, Roadmap 2020
ardumont claimed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.

Adapting the puppet manifest so we can discriminate issues per environment in sentry.

Jan 22 2020, 4:13 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont closed T2175: Deploy swh-icinga-plugins, a subtask of T1011: Enable continuous monitoring of deposit, as Resolved.
Jan 22 2020, 3:29 PM · Metrics/monitoring, SWORD deposit
ardumont closed T2175: Deploy swh-icinga-plugins as Resolved.
Jan 22 2020, 3:29 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Vault check deployed!

Jan 22 2020, 3:28 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Deposit check deployed!

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

debian package this

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz updated the task description for T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 22 2020, 2:11 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz renamed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Set SWH_MAIN_PACKAGE for all services to Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 22 2020, 2:10 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 20 2020

ardumont added a comment to T2175: Deploy swh-icinga-plugins.

debian package this

Jan 20 2020, 12:04 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 17 2020

ardumont claimed T2175: Deploy swh-icinga-plugins.

As far as i could tell so far:

  • debian package this
  • update puppet configuration to add the checks [1]
Jan 17 2020, 5:56 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 15 2020

vlorentz renamed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Set SWH_MAIN_PACKAGE for all SWH services to Set SWH_MAIN_PACKAGE for all services.
Jan 15 2020, 2:59 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz triaged T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services as Normal priority.
Jan 15 2020, 2:59 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz updated subscribers of T2180: Configure Jenkins to publish releases to Sentry.
Jan 15 2020, 2:58 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz created T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.
Jan 15 2020, 2:58 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz updated the task description for T2180: Configure Jenkins to publish releases to Sentry.
Jan 15 2020, 2:56 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz triaged T2180: Configure Jenkins to publish releases to Sentry as Normal priority.
Jan 15 2020, 2:56 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz added a project to T2175: Deploy swh-icinga-plugins: Sprint 2019/12 (Monitor and Conquer).
Jan 15 2020, 1:37 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 13 2020

vlorentz closed T2118: Deposit: End to End monitoring, a subtask of T2175: Deploy swh-icinga-plugins, as Resolved.
Jan 13 2020, 3:24 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz closed T2118: Deposit: End to End monitoring, a subtask of T1011: Enable continuous monitoring of deposit, as Resolved.
Jan 13 2020, 3:24 PM · Metrics/monitoring, SWORD deposit
vlorentz closed T2126: Production Vault end to end testing, a subtask of T2175: Deploy swh-icinga-plugins, as Resolved.
Jan 13 2020, 3:24 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz added subtasks for T2175: Deploy swh-icinga-plugins: T2118: Deposit: End to End monitoring, T2126: Production Vault end to end testing.
Jan 13 2020, 3:23 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz triaged T2175: Deploy swh-icinga-plugins as Normal priority.
Jan 13 2020, 3:23 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring

Jan 6 2020

olasd closed T1202: swh services: Monitor swh-worker@.service's status as Resolved.

I guess https://grafana.softwareheritage.org/d/Gyww7RfWz/workers-overview?orgId=1 implements this.

Jan 6 2020, 4:28 PM · Metrics/monitoring, System administration

Dec 19 2019

olasd moved T2133: Scheduler listener/runner: add statsd probes from done to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 19 2019, 2:07 PM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)
olasd moved T1359: Add sentry support in every swh running service from done to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 19 2019, 2:06 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd moved T1358: Setup a sentry service from done to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 19 2019, 2:06 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd moved T1358: Setup a sentry service from in progress to done on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 19 2019, 2:06 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd closed T1358: Setup a sentry service as Resolved.

Sentry is now available at https://sentry.softwareheritage.org/.

Dec 19 2019, 10:19 AM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
zack closed T1359: Add sentry support in every swh running service as Resolved.

(marking as done as it was moved to the done column on the sprint board, please reopen if not ok)

Dec 19 2019, 10:06 AM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
zack closed T1359: Add sentry support in every swh running service , a subtask of T1358: Setup a sentry service, as Resolved.
Dec 19 2019, 10:06 AM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration

Dec 16 2019

vlorentz changed the status of T2118: Deposit: End to End monitoring, a subtask of T1011: Enable continuous monitoring of deposit, from Open to Work in Progress.
Dec 16 2019, 4:09 PM · Metrics/monitoring, SWORD deposit
vlorentz moved T1359: Add sentry support in every swh running service from in progress to done on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 16 2019, 3:57 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration

Dec 11 2019

vlorentz closed T2142: Document how to use Sentry with the docker dev environment, a subtask of T1359: Add sentry support in every swh running service , as Resolved.
Dec 11 2019, 3:42 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz closed T2142: Document how to use Sentry with the docker dev environment as Resolved.
Dec 11 2019, 3:42 PM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz added revisions to T1359: Add sentry support in every swh running service : D2428: Add sentry integration to the JS code., D2426: Initialize Sentry on Celery worker startup., D2423: Add sentry integration to swh-web, D2411: Make the CLI initialize sentry-sdk based on CLI options/envvars., D2418: Add gunicorn config script to initialize sentry-sdk based on envvars., D2420: Import gunicorn config from swh-core..
Dec 11 2019, 3:41 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz claimed T1359: Add sentry support in every swh running service .
Dec 11 2019, 3:40 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz moved T1359: Add sentry support in every swh running service from Backlog to in progress on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 11 2019, 3:40 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz moved T2142: Document how to use Sentry with the docker dev environment from in progress to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 11 2019, 3:40 PM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz added a comment to T2142: Document how to use Sentry with the docker dev environment.

Resolved by D2424.

Dec 11 2019, 3:39 PM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring

Dec 10 2019

olasd added a comment to T2128: Monitor journal consumer lag.

Packaged and deployed the consumer group exporter on getty for both kafka clusters.

Dec 10 2019, 8:10 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)

Dec 9 2019

vlorentz renamed T2142: Document how to use Sentry with the docker dev environment from Add sentry to the docker-dev environment to Document how to use Sentry with the docker dev environment.
Dec 9 2019, 12:43 PM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz added a project to T2142: Document how to use Sentry with the docker dev environment: Docker environment.
Dec 9 2019, 10:33 AM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz moved T2142: Document how to use Sentry with the docker dev environment from Backlog to in progress on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 9 2019, 10:28 AM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz moved T1358: Setup a sentry service from Backlog to in progress on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 9 2019, 10:28 AM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz changed the status of T2142: Document how to use Sentry with the docker dev environment from Open to Work in Progress.
Dec 9 2019, 10:28 AM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring
vlorentz changed the status of T2142: Document how to use Sentry with the docker dev environment, a subtask of T1359: Add sentry support in every swh running service , from Open to Work in Progress.
Dec 9 2019, 10:28 AM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz triaged T2142: Document how to use Sentry with the docker dev environment as Normal priority.
Dec 9 2019, 10:27 AM · Docker environment, Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring

Dec 7 2019

olasd merged task T1361: Push rabbitmq metrics to Prometheus into T2130: Scheduler monitoring: probe rabbitmq status.
Dec 7 2019, 6:22 PM · Metrics/monitoring, System administration
olasd added a comment to T2128: Monitor journal consumer lag.

A quick test shows that https://github.com/braedon/prometheus-kafka-consumer-group-exporter does a decent job.

Dec 7 2019, 6:21 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)

Dec 6 2019

olasd changed the status of T1358: Setup a sentry service from Open to Work in Progress.

I think I've mostly coerced sentry, at url https://sentry.softwareheritage.org/, into working. I used the opportunity to start refactoring the way apache is handled in our puppet environment, as well as slowly migrating some vhosts to Let's Encrypt.

Dec 6 2019, 11:06 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
zack moved T2133: Scheduler listener/runner: add statsd probes from Backlog to done on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 6 2019, 12:30 PM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)
douardda closed T2133: Scheduler listener/runner: add statsd probes as Resolved.

Closed by D2394

Dec 6 2019, 10:22 AM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)

Dec 4 2019

zack moved T1360: Install a sentry server from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Dec 4 2019, 3:37 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
douardda added a revision to T2133: Scheduler listener/runner: add statsd probes: D2394: celery: add 2 statsd probes for the runner and listener.
Dec 4 2019, 10:58 AM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)

Dec 3 2019

olasd closed T1360: Install a sentry server, a subtask of T1358: Setup a sentry service, as Resolved.
Dec 3 2019, 6:09 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd closed T1360: Install a sentry server as Resolved.

The new virtual machine for sentry, [[ https://en.m.wikipedia.org/wiki/Riverside_Museum | riverside.internal.softwareheritage.org ]], has now been installed.

Dec 3 2019, 6:09 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz lowered the priority of T1359: Add sentry support in every swh running service from High to Normal.
Dec 3 2019, 5:50 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz lowered the priority of T2128: Monitor journal consumer lag from High to Normal.
Dec 3 2019, 5:50 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)
vlorentz lowered the priority of T2133: Scheduler listener/runner: add statsd probes from High to Normal.
Dec 3 2019, 5:50 PM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)
vlorentz added a project to T2128: Monitor journal consumer lag: Metrics/monitoring.
Dec 3 2019, 3:23 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)
vlorentz added a project to T2133: Scheduler listener/runner: add statsd probes: Metrics/monitoring.
Dec 3 2019, 3:19 PM · Metrics/monitoring, Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)
vlorentz raised the priority of T1359: Add sentry support in every swh running service from Normal to High.
Dec 3 2019, 3:17 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz raised the priority of T1360: Install a sentry server from Normal to High.
Dec 3 2019, 3:17 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
vlorentz raised the priority of T1358: Setup a sentry service from Normal to High.
Dec 3 2019, 3:17 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration

Dec 2 2019

moranegg added a subtask for T1011: Enable continuous monitoring of deposit: T2118: Deposit: End to End monitoring.
Dec 2 2019, 4:16 PM · Metrics/monitoring, SWORD deposit
olasd added a project to T1360: Install a sentry server: Sprint 2019/12 (Monitor and Conquer).
Dec 2 2019, 2:33 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd added a project to T1359: Add sentry support in every swh running service : Sprint 2019/12 (Monitor and Conquer).
Dec 2 2019, 2:33 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
olasd added a project to T1358: Setup a sentry service: Sprint 2019/12 (Monitor and Conquer).
Dec 2 2019, 2:33 PM · Sprint 2019/12 (Monitor and Conquer), Metrics/monitoring, System administration
moranegg updated the task description for T1011: Enable continuous monitoring of deposit.
Dec 2 2019, 12:21 PM · Metrics/monitoring, SWORD deposit

Nov 14 2019

moranegg triaged T2087: Create script to test SWORD deposit on SWH as Normal priority.
Nov 14 2019, 10:45 AM · Metrics/monitoring, SWORD deposit
moranegg triaged T2086: create test script to deposit software on HAL as Normal priority.
Nov 14 2019, 10:35 AM · Metrics/monitoring, SWORD deposit
moranegg triaged T2085: Automate integration test from HAL to SWH as Normal priority.
Nov 14 2019, 10:33 AM · Metrics/monitoring, SWORD deposit

Nov 5 2019

moranegg triaged T2058: Specify an automated approach for status page for the deposit as Normal priority.
Nov 5 2019, 10:55 AM · Metrics/monitoring, SWORD deposit
moranegg updated the task description for T1011: Enable continuous monitoring of deposit.
Nov 5 2019, 10:54 AM · Metrics/monitoring, SWORD deposit

Oct 30 2019

vlorentz added a project to T1009: Create email notifications for deposit errors: Metrics/monitoring.
Oct 30 2019, 12:27 PM · Metrics/monitoring, SWORD deposit

Sep 6 2019

ardumont updated the task description for T1202: swh services: Monitor swh-worker@.service's status.
Sep 6 2019, 2:14 PM · Metrics/monitoring, System administration

May 25 2019

zack closed T1180: add munin monitoring of snapshot objects count as Resolved.

snapshot count is now there, closing

May 25 2019, 5:25 PM · Metrics/monitoring, System administration
zack added a project to T1362: Upgrade the Prometheus setup to Thanos : System administration.
May 25 2019, 5:10 PM · System administration, Metrics/monitoring

May 17 2019

anlambert closed T1490: Use origin url on external-id attribute on deposit admin page, a subtask of T1011: Enable continuous monitoring of deposit, as Resolved.
May 17 2019, 2:27 PM · Metrics/monitoring, SWORD deposit
anlambert closed T1490: Use origin url on external-id attribute on deposit admin page as Resolved by committing rDWAPPS7a671af936f5: admin/deposit: Improve reporting graphical interface.
May 17 2019, 2:27 PM · Metrics/monitoring, SWORD deposit

May 16 2019

anlambert claimed T1490: Use origin url on external-id attribute on deposit admin page.
May 16 2019, 2:31 PM · Metrics/monitoring, SWORD deposit

Apr 10 2019

vlorentz moved T1607: Graph object count per object storage backend from Backlog to in progress on the Sprint 2019 03 board.
Apr 10 2019, 10:52 AM · Metrics/monitoring, Sprint 2019 03
vlorentz moved T1621: Add metrics to the currently deployed kafka cluster from Backlog to in progress on the Sprint 2019 03 board.
Apr 10 2019, 10:52 AM · System administration, Metrics/monitoring, Sprint 2019 03

Apr 2 2019

zack added projects to T1621: Add metrics to the currently deployed kafka cluster: Metrics/monitoring, System administration.
Apr 2 2019, 2:43 PM · System administration, Metrics/monitoring, Sprint 2019 03

Mar 26 2019

olasd changed the status of T1607: Graph object count per object storage backend from Open to Work in Progress.

https://grafana.softwareheritage.org/d/jScG7g6mk/objstorage-object-counts shows the data that we're currently able to collect.

Mar 26 2019, 6:52 PM · Metrics/monitoring, Sprint 2019 03
olasd triaged T1607: Graph object count per object storage backend as High priority.
Mar 26 2019, 4:31 PM · Metrics/monitoring, Sprint 2019 03

Mar 20 2019

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, as Resolved.
Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12
ftigeot closed T1428: Create an inventory of useful Munin metrics as Resolved.

Already marked as done on 2018-12-19.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12

Mar 11 2019

olasd added a comment to T1435: Improve swh-scheduler prometheus metrics.

This gargantuan query is now used on a grafana dashboard : https://grafana.softwareheritage.org/d/-lJ73Ujiz/scheduler-task-status

Mar 11 2019, 5:58 PM · Metrics/monitoring, Sprint 2018 12

Mar 8 2019

olasd added a comment to T1435: Improve swh-scheduler prometheus metrics.
with task_count_by_bucket as (
  -- get the count of tasks by delay bucket. Tasks are grouped by their
  -- characteristics (type, status, policy, priority, current interval),
  -- then by delay buckets that are 1 hour wide between -24 and +24 hours,
  -- and 1 day wide outside of this range.
  -- A positive delay means the task execution is late wrt scheduling.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    (
      -- select the bucket widths
      case when delay between - 24 * 3600 and 24 * 3600 then
        (ceil(delay / 3600)::bigint) * 3600
      else
        (ceil(delay / (24 * 3600))::bigint) * 24 * 3600
      end
    ) as delay_bucket,
    count(*)
  from
    task
    join lateral (
      -- this is where the "positive = late" convention is set
      select
        extract(epoch from (now() - next_run)) as delay
    ) as d on true
    group by
      "type",
      status,
      "policy",
      priority,
      current_interval,
      delay_bucket
    order by
      "type",
      status,
      "policy",
      priority,
      current_interval,
      delay_bucket
),
delay_bounds as (
  -- get the minimum and maximum delay bucket for each task group. This will
  -- let us generate all the buckets, even the empty ones in the next CTE.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    min(delay_bucket) as min,
    max(delay_bucket) as max
  from
    task_count_by_bucket
  group by
    "type",
    status,
    "policy",
    priority,
    current_interval
),
task_buckets as (
  -- Generate all time buckets for all categories.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    delay_bucket
  from
    delay_bounds
    join lateral (
      -- 1 hour buckets
      select
        generate_series(- 23, 23) * 3600 as delay_bucket
      union
      -- 1 day buckets. The "- 1" is used to make sure we generate an empty
      -- bucket as lowest delay bucket, so prometheus quantile calculations
      -- stay accurate
      select
        generate_series(min / (24 * 3600) - 1, max / (24 * 3600)) * 24 * 3600 as delay_bucket
    ) as buckets on true
),
task_count_for_all_buckets as (
  -- This join merges the non-empty buckets (task_count_by_bucket) with
  -- the full list of buckets (task_buckets).
  -- The join clause can't use the "using (x, y, z)" syntax, as it uses
  -- equality and priority and current_interval can be null. This also
  -- forces us to label all the fields in the select. Ugh.
  select
    task_buckets."type",
    task_buckets.status,
    task_buckets."policy",
    task_buckets.priority,
    task_buckets.current_interval,
    task_buckets.delay_bucket,
    coalesce(count, 0) as count -- make sure empty buckets have a 0 count instead of null
  from
    task_buckets
  left join task_count_by_bucket
    on task_count_by_bucket."type" = task_buckets."type"
    and task_count_by_bucket.status = task_buckets.status
    and task_count_by_bucket. "policy" = task_buckets."policy"
    and task_count_by_bucket.priority is not distinct from task_buckets.priority
    and task_count_by_bucket.current_interval is not distinct from task_buckets.current_interval
    and task_count_by_bucket.delay_bucket = task_buckets.delay_bucket
),
cumulative_buckets as (
  -- Prometheus wants cumulative histograms: for each bucket, the value
  -- needs to be the total of all measurements below the given value (this
  -- allows downsampling by just throwing away some buckets). We use the
  -- "sum over partition" window function to compute this.
  -- Prometheus also expects a "+Inf" bucket for the total count. We
  -- generate it with a null lt value so we can sort it after the rest of
  -- the buckets.
Mar 8 2019, 12:12 PM · Metrics/monitoring, Sprint 2018 12