We've definitely improved on this (notably using proper hostnames for the instance label on prom metrics). I think we should make this task more actionable if we want to keep it open.

Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12

olasd closed T1438: Add labels to prometheus metrics to help queries, a subtask of T1408: More/better Metrics, as Resolved.

Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12

moranegg moved T1406: Documentation/tutorial for using public datasets (Athena/AWS) from Backlog to archive-users (docs/user-guides/) on the Documentation board.

Sep 22 2020, 3:08 PM · Documentation, Sprint 2018 12

Sep 16 2020

ardumont added a subtask for T1410: Kill implicit configuration: new configuration scheme: T1532: Cleanup deprecated configuration code in swh modules.

Sep 16 2020, 5:57 PM · Core & foundations

ardumont closed T1386: Refactor indexers' initialization step, a subtask of T1410: Kill implicit configuration: new configuration scheme, as Wontfix.

Sep 16 2020, 5:56 PM · Core & foundations

Sep 8 2020

zack placed T1697: Deploy Grafanalib-based dashboards with Puppet up for grabs.

Sep 8 2020, 8:59 AM · Sprint 2018 12, System administration

Sep 4 2020

vsellier added a comment to T1414: Set up an inventory app.

Wikimedia is using netbox as the source of trust in their infrastructure and puppet is configuring the facts from it. It's not exactly the same use case we want as we would like to have netbox automatically provisioned.

Sep 4 2020, 5:14 PM · System administration, Sprint 2018 12

vsellier added a comment to T1414: Set up an inventory app.

and their documentation : https://wikitech.wikimedia.org/wiki/Netbox

Sep 4 2020, 11:20 AM · System administration, Sprint 2018 12

vsellier added a comment to T1414: Set up an inventory app.

A docker-compose is available to easily test netbox : https://github.com/netbox-community/netbox-docker
This is the puppet configuration used at wikimedia : https://gerrit.wikimedia.org/r/c/operations/puppet/+/387880/

Sep 4 2020, 11:17 AM · System administration, Sprint 2018 12

Feb 11 2020

olasd added a comment to T1414: Set up an inventory app.

Netbox looks pretty nice as a full hardware/device inventory tool: https://netbox.readthedocs.io/en/stable/

Feb 11 2020, 4:44 PM · System administration, Sprint 2018 12

Nov 27 2019

ftigeot added a comment to T1697: Deploy Grafanalib-based dashboards with Puppet.

Puppet changes added in 17b2b3041212aca9e0a9a35c510885de7bb78230.
Ideally the Debian package should now be added to the Software Heritage private repository.

Nov 27 2019, 11:22 AM · Sprint 2018 12, System administration

Nov 26 2019

ardumont closed T1425: refactor the loader stack for package managers as Resolved.

Nov 26 2019, 12:25 PM · Sprint 2018 12

ardumont closed T1425: refactor the loader stack for package managers, a subtask of T1418: Loaders, as Resolved.

Nov 26 2019, 12:25 PM · Sprint 2018 12

ardumont closed T1389: Implement a base "package" loader for package managers, a subtask of T1425: refactor the loader stack for package managers, as Resolved.

Nov 26 2019, 12:25 PM · Sprint 2018 12

Nov 25 2019

ftigeot added a comment to T1697: Deploy Grafanalib-based dashboards with Puppet.

Instructions to create Debian packages have been added in D2352.

Nov 25 2019, 4:45 PM · Sprint 2018 12, System administration

Nov 24 2019

zack closed T1356: Kill munin as Resolved.

AFAIU from last week work, munin is now gone

Nov 24 2019, 8:29 PM · Sprint 2018 12, System administration

Nov 19 2019

ftigeot changed the status of T1556: Document hardware architecture, a subtask of T1407: Internal documentation (meta task), from Open to Work in Progress.

Nov 19 2019, 2:18 PM · Documentation, Sprint 2018 12

Nov 8 2019

ftigeot closed T1653: Prometheus rate functions considered unreliable, a subtask of T1356: Kill munin, as Wontfix.

Nov 8 2019, 11:43 AM · Sprint 2018 12, System administration

ftigeot closed T1653: Prometheus rate functions considered unreliable as Wontfix.

No relevant problem has been reported with our dataset/usage of Prometheus. Closing.

Nov 8 2019, 11:43 AM · Sprint 2018 12, System administration

Nov 6 2019

ftigeot closed T1442: Replace Munin graphs with Grafana/Prometheus dashboards, a subtask of T1356: Kill munin, as Resolved.

Nov 6 2019, 12:11 PM · Sprint 2018 12, System administration

ftigeot closed T1442: Replace Munin graphs with Grafana/Prometheus dashboards as Resolved.

I do not see any missing piece in the Grafana dashboard, the Munin graph service/VM can be shut down.

Nov 6 2019, 12:11 PM · Sprint 2018 12, System administration

douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Any chance we can close this now?

Nov 6 2019, 10:31 AM · Sprint 2018 12, System administration

Oct 1 2019

ardumont changed the status of T1389: Implement a base "package" loader for package managers, a subtask of T1425: refactor the loader stack for package managers, from Open to Work in Progress.

Oct 1 2019, 1:20 PM · Sprint 2018 12

Sep 6 2019

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Sep 6 2019, 11:20 AM · Development environment, Sprint 2018 12

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Sep 6 2019, 11:20 AM · Development environment, Sprint 2018 12

Sep 5 2019

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Sep 5 2019, 11:23 AM · Development environment, Sprint 2018 12

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Sep 5 2019, 11:23 AM · Development environment, Sprint 2018 12

Aug 29 2019

anlambert updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Aug 29 2019, 12:08 PM · Development environment, Sprint 2018 12

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Aug 29 2019, 10:30 AM · Development environment, Sprint 2018 12

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Aug 29 2019, 10:30 AM · Development environment, Sprint 2018 12

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Aug 29 2019, 10:29 AM · Development environment, Sprint 2018 12

Aug 6 2019

anlambert added a subtask for T1411: reach a minimum of 80% SLOC coverage across all components: T1768: Add end to end tests for the frontend part of swh-web.

Aug 6 2019, 11:32 AM · Development environment, Sprint 2018 12

anlambert updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Aug 6 2019, 11:31 AM · Development environment, Sprint 2018 12

Aug 1 2019

ardumont renamed T1413: swh-docker-dev: Refactor/improve provisionning step from Refactor/improve provisionning step to swh-docker-dev: Refactor/improve provisionning step.

Aug 1 2019, 9:40 AM · Development environment, System administration, Sprint 2018 12

Jul 16 2019

ftigeot closed T1355: Move the object counter from munin to prometheus, a subtask of T1356: Kill munin, as Resolved.

Jul 16 2019, 3:20 PM · Sprint 2018 12, System administration

Jul 11 2019

zack updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Jul 11 2019, 2:07 PM · Development environment, Sprint 2018 12

Jun 12 2019

zack added a comment to T1411: reach a minimum of 80% SLOC coverage across all components.

The most recent update of the state of this task has shown a regression in the journal test coverage, which, per se, is not a big deal (just a few points). But it does raise the question of how, once we have attained whatever "minimum" coverage we are OK with, we monitor overtime that there is no regression. For instance, I think that code reviews should show to the reviewers how the submitted diff affects code coverage. Ideally, reviewers should be able to so if it has a net positive or negative effect on coverage, and take that into account in their review decisions. (Which is not to say we should never accept diffs that decrease code coverage—there might be reasons to do so. But it is a data point that would be useful for reviewers to see.)

Jun 12 2019, 12:25 PM · Development environment, Sprint 2018 12

zack updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Jun 12 2019, 12:23 PM · Development environment, Sprint 2018 12

Jun 6 2019

rdicosmo added a comment to T1419: hg/svn support in save code now.

Jun 6 2019, 9:34 AM · Web app, Sprint 2018 12

May 25 2019

zack renamed T1411: reach a minimum of 80% SLOC coverage across all components from at least 80% SLOC coverage in all components to reach a minimum of 80% SLOC coverage across all components.

May 25 2019, 5:39 PM · Development environment, Sprint 2018 12

zack added a comment to T1411: reach a minimum of 80% SLOC coverage across all components.

only 3% to go in -lister and -core \o/

May 25 2019, 5:36 PM · Development environment, Sprint 2018 12

zack updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

May 25 2019, 5:35 PM · Development environment, Sprint 2018 12

zack closed T1418: Loaders as Invalid.

these catch-all meta-tasks that will grow forever are not terribly useful, the individual tasks + their tasks should be enough

May 25 2019, 5:32 PM · Sprint 2018 12

May 13 2019

olasd closed T1698: Make sure Grafana dashboards are backed up as Resolved.

The grafana dashboards are stored in the postgresql database on pergamon, which is backed up through the full system backups.

May 13 2019, 1:58 PM · Sprint 2018 12, System administration

olasd closed T1698: Make sure Grafana dashboards are backed up, a subtask of T1442: Replace Munin graphs with Grafana/Prometheus dashboards, as Resolved.

May 13 2019, 1:58 PM · Sprint 2018 12, System administration

May 3 2019

olasd closed T1419: hg/svn support in save code now as Resolved.

There was a config/deployment bug on both the hg and svn loaders. Both bugs have been fixed and the tasks are running fine now.

May 3 2019, 11:44 AM · Web app, Sprint 2018 12

anlambert reopened T1419: hg/svn support in save code now as "Open".

Reopening this as the first submitted save code now tasks for hg and svn did not get executed so far (see [1]).
Nevertheless, they have been scheduled so this looks like some extra workers configuration is needed in production.

May 3 2019, 10:30 AM · Web app, Sprint 2018 12

May 2 2019

rdicosmo added a comment to T1419: hg/svn support in save code now.

Thanks... looks like the tasks have been properly scheduled, but they have
not been executed... some more polishing may be needed ...

May 2 2019, 10:00 PM · Web app, Sprint 2018 12

anlambert added a comment to T1419: hg/svn support in save code now.

@rdicosmo , the possibility to submit hg and svn origin types through the "Save code now" form has been deployed to production [1].
I have submitted one origin of each type to save. Let's see if the underlying scheduler tasks get correctly executed before spreading
the news to the wild.

May 2 2019, 5:28 PM · Web app, Sprint 2018 12

anlambert closed T1419: hg/svn support in save code now as Resolved by committing rDWAPPS04b06d85c494: templates/origin-save.html: hg and svn origin types can now be saved.

May 2 2019, 5:00 PM · Web app, Sprint 2018 12

douardda added a comment to T1419: hg/svn support in save code now.

we (@anlambert and I) will try to have this task closed ASAP (like today, if no big bad stopper arise in front of us)

May 2 2019, 11:55 AM · Web app, Sprint 2018 12

Apr 30 2019

ftigeot triaged T1698: Make sure Grafana dashboards are backed up as High priority.

Apr 30 2019, 3:38 PM · Sprint 2018 12, System administration

ftigeot changed the status of T1697: Deploy Grafanalib-based dashboards with Puppet, a subtask of T1442: Replace Munin graphs with Grafana/Prometheus dashboards, from Open to Work in Progress.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot changed the status of T1697: Deploy Grafanalib-based dashboards with Puppet from Open to Work in Progress.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot triaged T1697: Deploy Grafanalib-based dashboards with Puppet as High priority.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Grafanalib dashboards added to https://grafana.softwareheritage.org/ via the new provisioning mechanism of Grafana 5.x.
Fully automated provisioning is still a work-in-progress.

Apr 30 2019, 3:36 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Prometheus does not provide storage device statistics for Proxmox container-based hosts.
The data can be read from their parent machine dashboards though.

Apr 30 2019, 12:28 PM · Sprint 2018 12, System administration

Apr 19 2019

rdicosmo added a comment to T1419: hg/svn support in save code now.

Are there any blockers left? It would be really nice to roll this out in the very near future.

Apr 19 2019, 2:43 PM · Web app, Sprint 2018 12

Apr 18 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.

Apr 18 2019, 2:44 PM · Sprint 2018 12, System administration

douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

It is these graphs we are still missing.

Apr 18 2019, 9:19 AM · Sprint 2018 12, System administration

Apr 16 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.

Apr 16 2019, 5:09 PM · Sprint 2018 12, System administration

ftigeot triaged T1653: Prometheus rate functions considered unreliable as Normal priority.

Apr 16 2019, 4:55 PM · Sprint 2018 12, System administration

douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Indeed it is the object of T1428. That's why I am a bit puzzled the work you have in progress does not simply target T1356. I was expecting some response to this very task in your grafanalib based code, which I did not find. So I was wondering if I missed something, that some data where still in munin only.

Apr 16 2019, 3:04 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.

Apr 16 2019, 2:05 PM · Sprint 2018 12, System administration

douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

In T1442#30575, @ftigeot wrote:

When I asked where to put such work-in-progress, you suggested the snippets repository.

Apr 16 2019, 12:04 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

When I asked where to put such work-in-progress, you suggested the snippets repository.

Apr 16 2019, 11:00 AM · Sprint 2018 12, System administration

douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

In T1442#30564, @ftigeot wrote:

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

Apr 16 2019, 9:41 AM · Sprint 2018 12, System administration

Apr 15 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

Apr 15 2019, 4:53 PM · Sprint 2018 12, System administration

Apr 13 2019

ardumont closed T1459: docker container for swh-deposit as Resolved.

(typo) nothing interesting, moving along.

Apr 13 2019, 12:49 PM · Sprint 2018 12

ardumont closed T1459: docker container for swh-deposit, a subtask of T1443: Make swh services run within docker and docker-compose, as Resolved.

Apr 13 2019, 12:49 PM · Docker environment, Sprint 2018 12

Apr 12 2019

ardumont added a parent task for T1459: docker container for swh-deposit: T1581: Deposit: improvements.

Apr 12 2019, 11:17 PM · Sprint 2018 12

ardumont added a comment to T1459: docker container for swh-deposit.

Apr 12 2019, 3:52 PM · Sprint 2018 12

Apr 2 2019

anlambert closed T1379: npm loader, a subtask of T1418: Loaders, as Resolved.

Apr 2 2019, 3:38 PM · Sprint 2018 12

anlambert closed T1379: npm loader, a subtask of T1425: refactor the loader stack for package managers, as Resolved.

Apr 2 2019, 3:38 PM · Sprint 2018 12

Mar 26 2019

ardumont updated the task description for T1411: reach a minimum of 80% SLOC coverage across all components.

Mar 26 2019, 1:17 PM · Development environment, Sprint 2018 12

Mar 25 2019

douardda closed T1405: Make it easy to run a complete swh instance, a subtask of T1413: swh-docker-dev: Refactor/improve provisionning step, as Resolved.

Mar 25 2019, 10:52 AM · Development environment, System administration, Sprint 2018 12

douardda closed T1405: Make it easy to run a complete swh instance as Resolved.

Let's call it done, event if the small dataset part has not been addressed.

Mar 25 2019, 10:52 AM · Docker environment, Sprint 2018 12

douardda closed T1443: Make swh services run within docker and docker-compose, a subtask of T1405: Make it easy to run a complete swh instance, as Resolved.

Mar 25 2019, 10:51 AM · Docker environment, Sprint 2018 12

douardda closed T1443: Make swh services run within docker and docker-compose as Resolved.

Let's call it done, some minor parts may still need a bit of attention thou

Mar 25 2019, 10:51 AM · Docker environment, Sprint 2018 12

douardda updated the task description for T1443: Make swh services run within docker and docker-compose.

Mar 25 2019, 10:50 AM · Docker environment, Sprint 2018 12

douardda closed T1420: Identify easy tasks as Resolved.

Consider this done. Even if remains a background task.

Mar 25 2019, 10:49 AM · Documentation, Sprint 2018 12

douardda updated the task description for T1405: Make it easy to run a complete swh instance.

Mar 25 2019, 10:48 AM · Docker environment, Sprint 2018 12

Mar 20 2019

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, as Resolved.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12

ftigeot closed T1428: Create an inventory of useful Munin metrics as Resolved.

Already marked as done on 2018-12-19.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1356: Kill munin, as Resolved.

Mar 20 2019, 11:43 AM · Sprint 2018 12, System administration

Mar 11 2019

olasd added a comment to T1435: Improve swh-scheduler prometheus metrics.

This gargantuan query is now used on a grafana dashboard : https://grafana.softwareheritage.org/d/-lJ73Ujiz/scheduler-task-status

Mar 11 2019, 5:58 PM · Metrics/monitoring, Sprint 2018 12

Mar 8 2019

olasd added a comment to T1435: Improve swh-scheduler prometheus metrics.

with task_count_by_bucket as (
  -- get the count of tasks by delay bucket. Tasks are grouped by their
  -- characteristics (type, status, policy, priority, current interval),
  -- then by delay buckets that are 1 hour wide between -24 and +24 hours,
  -- and 1 day wide outside of this range.
  -- A positive delay means the task execution is late wrt scheduling.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    (
      -- select the bucket widths
      case when delay between - 24 * 3600 and 24 * 3600 then
        (ceil(delay / 3600)::bigint) * 3600
      else
        (ceil(delay / (24 * 3600))::bigint) * 24 * 3600
      end
    ) as delay_bucket,
    count(*)
  from
    task
    join lateral (
      -- this is where the "positive = late" convention is set
      select
        extract(epoch from (now() - next_run)) as delay
    ) as d on true
    group by
      "type",
      status,
      "policy",
      priority,
      current_interval,
      delay_bucket
    order by
      "type",
      status,
      "policy",
      priority,
      current_interval,
      delay_bucket
),
delay_bounds as (
  -- get the minimum and maximum delay bucket for each task group. This will
  -- let us generate all the buckets, even the empty ones in the next CTE.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    min(delay_bucket) as min,
    max(delay_bucket) as max
  from
    task_count_by_bucket
  group by
    "type",
    status,
    "policy",
    priority,
    current_interval
),
task_buckets as (
  -- Generate all time buckets for all categories.
  select
    "type",
    status,
    "policy",
    priority,
    current_interval,
    delay_bucket
  from
    delay_bounds
    join lateral (
      -- 1 hour buckets
      select
        generate_series(- 23, 23) * 3600 as delay_bucket
      union
      -- 1 day buckets. The "- 1" is used to make sure we generate an empty
      -- bucket as lowest delay bucket, so prometheus quantile calculations
      -- stay accurate
      select
        generate_series(min / (24 * 3600) - 1, max / (24 * 3600)) * 24 * 3600 as delay_bucket
    ) as buckets on true
),
task_count_for_all_buckets as (
  -- This join merges the non-empty buckets (task_count_by_bucket) with
  -- the full list of buckets (task_buckets).
  -- The join clause can't use the "using (x, y, z)" syntax, as it uses
  -- equality and priority and current_interval can be null. This also
  -- forces us to label all the fields in the select. Ugh.
  select
    task_buckets."type",
    task_buckets.status,
    task_buckets."policy",
    task_buckets.priority,
    task_buckets.current_interval,
    task_buckets.delay_bucket,
    coalesce(count, 0) as count -- make sure empty buckets have a 0 count instead of null
  from
    task_buckets
  left join task_count_by_bucket
    on task_count_by_bucket."type" = task_buckets."type"
    and task_count_by_bucket.status = task_buckets.status
    and task_count_by_bucket. "policy" = task_buckets."policy"
    and task_count_by_bucket.priority is not distinct from task_buckets.priority
    and task_count_by_bucket.current_interval is not distinct from task_buckets.current_interval
    and task_count_by_bucket.delay_bucket = task_buckets.delay_bucket
),
cumulative_buckets as (
  -- Prometheus wants cumulative histograms: for each bucket, the value
  -- needs to be the total of all measurements below the given value (this
  -- allows downsampling by just throwing away some buckets). We use the
  -- "sum over partition" window function to compute this.
  -- Prometheus also expects a "+Inf" bucket for the total count. We
  -- generate it with a null lt value so we can sort it after the rest of
  -- the buckets.