I'm pretty sure this is done now ;p
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 21 2020
Feb 15 2020
Jan 27 2020
Jan 23 2020
Deployed.
Jan 22 2020
Adapting the puppet manifest so we can discriminate issues per environment in sentry.
Vault check deployed!
Deposit check deployed!
debian package this
Jan 20 2020
debian package this
Jan 17 2020
As far as i could tell so far:
- debian package this
- update puppet configuration to add the checks [1]
Jan 15 2020
Jan 13 2020
Jan 6 2020
I guess https://grafana.softwareheritage.org/d/Gyww7RfWz/workers-overview?orgId=1 implements this.
Dec 19 2019
Sentry is now available at https://sentry.softwareheritage.org/.
(marking as done as it was moved to the done column on the sprint board, please reopen if not ok)
Dec 16 2019
Dec 11 2019
Resolved by D2424.
Dec 10 2019
Packaged and deployed the consumer group exporter on getty for both kafka clusters.
Dec 9 2019
Dec 7 2019
A quick test shows that https://github.com/braedon/prometheus-kafka-consumer-group-exporter does a decent job.
Dec 6 2019
I think I've mostly coerced sentry, at url https://sentry.softwareheritage.org/, into working. I used the opportunity to start refactoring the way apache is handled in our puppet environment, as well as slowly migrating some vhosts to Let's Encrypt.
Closed by D2394
Dec 4 2019
Dec 3 2019
The new virtual machine for sentry, [[ https://en.m.wikipedia.org/wiki/Riverside_Museum | riverside.internal.softwareheritage.org ]], has now been installed.
Dec 2 2019
Nov 14 2019
Nov 5 2019
Oct 30 2019
Sep 6 2019
May 25 2019
snapshot count is now there, closing
May 17 2019
May 16 2019
Apr 10 2019
Apr 2 2019
Mar 26 2019
https://grafana.softwareheritage.org/d/jScG7g6mk/objstorage-object-counts shows the data that we're currently able to collect.
Mar 20 2019
Already marked as done on 2018-12-19.
Mar 11 2019
This gargantuan query is now used on a grafana dashboard : https://grafana.softwareheritage.org/d/-lJ73Ujiz/scheduler-task-status
Mar 8 2019
with task_count_by_bucket as ( -- get the count of tasks by delay bucket. Tasks are grouped by their -- characteristics (type, status, policy, priority, current interval), -- then by delay buckets that are 1 hour wide between -24 and +24 hours, -- and 1 day wide outside of this range. -- A positive delay means the task execution is late wrt scheduling. select "type", status, "policy", priority, current_interval, ( -- select the bucket widths case when delay between - 24 * 3600 and 24 * 3600 then (ceil(delay / 3600)::bigint) * 3600 else (ceil(delay / (24 * 3600))::bigint) * 24 * 3600 end ) as delay_bucket, count(*) from task join lateral ( -- this is where the "positive = late" convention is set select extract(epoch from (now() - next_run)) as delay ) as d on true group by "type", status, "policy", priority, current_interval, delay_bucket order by "type", status, "policy", priority, current_interval, delay_bucket ), delay_bounds as ( -- get the minimum and maximum delay bucket for each task group. This will -- let us generate all the buckets, even the empty ones in the next CTE. select "type", status, "policy", priority, current_interval, min(delay_bucket) as min, max(delay_bucket) as max from task_count_by_bucket group by "type", status, "policy", priority, current_interval ), task_buckets as ( -- Generate all time buckets for all categories. select "type", status, "policy", priority, current_interval, delay_bucket from delay_bounds join lateral ( -- 1 hour buckets select generate_series(- 23, 23) * 3600 as delay_bucket union -- 1 day buckets. The "- 1" is used to make sure we generate an empty -- bucket as lowest delay bucket, so prometheus quantile calculations -- stay accurate select generate_series(min / (24 * 3600) - 1, max / (24 * 3600)) * 24 * 3600 as delay_bucket ) as buckets on true ), task_count_for_all_buckets as ( -- This join merges the non-empty buckets (task_count_by_bucket) with -- the full list of buckets (task_buckets). -- The join clause can't use the "using (x, y, z)" syntax, as it uses -- equality and priority and current_interval can be null. This also -- forces us to label all the fields in the select. Ugh. select task_buckets."type", task_buckets.status, task_buckets."policy", task_buckets.priority, task_buckets.current_interval, task_buckets.delay_bucket, coalesce(count, 0) as count -- make sure empty buckets have a 0 count instead of null from task_buckets left join task_count_by_bucket on task_count_by_bucket."type" = task_buckets."type" and task_count_by_bucket.status = task_buckets.status and task_count_by_bucket. "policy" = task_buckets."policy" and task_count_by_bucket.priority is not distinct from task_buckets.priority and task_count_by_bucket.current_interval is not distinct from task_buckets.current_interval and task_count_by_bucket.delay_bucket = task_buckets.delay_bucket ), cumulative_buckets as ( -- Prometheus wants cumulative histograms: for each bucket, the value -- needs to be the total of all measurements below the given value (this -- allows downsampling by just throwing away some buckets). We use the -- "sum over partition" window function to compute this. -- Prometheus also expects a "+Inf" bucket for the total count. We -- generate it with a null lt value so we can sort it after the rest of -- the buckets.