Page MenuHomeSoftware Heritage
Feed Advanced Search

Dec 3 2021

ardumont moved T1481: add metric to monitor "save code now" efficiency from deployed/landed/monitoring to Backlog on the System administration board.
Dec 3 2021, 3:57 PM · Save Code Now, System administration, Metrics/monitoring

Aug 26 2021

olasd merged T1278: swh-journal: the monitoring tool question! into T2128: Monitor journal consumer lag.
Aug 26 2021, 12:30 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)
olasd added a comment to T2128: Monitor journal consumer lag.

This would have caught T3502 earlier too.

Aug 26 2021, 12:27 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer)

Aug 3 2021

ardumont added a comment to T3127: Compute and display distribution of origins by forge.

The computation of those metrics will be executed in production on a regular basis, probably each day, to keep them up to date.

Aug 3 2021, 5:00 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
ardumont added a revision to T3127: Compute and display distribution of origins by forge: D6052: Install update-metrics as a service called daily.
Aug 3 2021, 2:32 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 29 2021

ardumont changed the status of T3402: Deploy swh-counters v0.8.0 and backfill origins, a subtask of T3127: Compute and display distribution of origins by forge, from Wontfix to Resolved.
Jul 29 2021, 1:24 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
ardumont changed the status of T3402: Deploy swh-counters v0.8.0 and backfill origins from Wontfix to Resolved.
Jul 29 2021, 1:24 PM · Counters, System administration, Metrics/monitoring

Jul 23 2021

anlambert added a comment to T3127: Compute and display distribution of origins by forge.
In T3127#67581, @anlambert wrote:

    I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Indeed there is something weird here as we have more than one million gitlab.com origins in database.

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

Looks like something was missed when computing lister metrics from scheduler database, this needs further investigations.

Indeed, please do look into this, thanks.

Jul 23 2021, 12:17 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 22 2021

anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Thanks for these details: this count is missing the 800k git origins: @ardumont and @olasd should be able to tell you how to find them

Jul 22 2021, 12:29 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
rdicosmo added a comment to T3127: Compute and display distribution of origins by forge.

I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Indeed there is something weird here as we have more than one million gitlab.com origins in database.

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

Looks like something was missed when computing lister metrics from scheduler database, this needs further investigations.

Jul 22 2021, 9:01 AM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 21 2021

anlambert added a comment to T3127: Compute and display distribution of origins by forge.

I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Jul 21 2021, 5:26 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
rdicosmo added a comment to T3127: Compute and display distribution of origins by forge.

I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?
And we know we had some 1.5m origins for Google code, why only 700k shown here?

Jul 21 2021, 3:40 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Instead, we could split the coverage widget into two tabs

  • one giving a high level overview of the archived origins, similar to what we have now with logos and counters
  • one giving the details of all forges we archived so far, displayed in a table as you suggested with relevant metrics and links to search origins for a given forge
Jul 21 2021, 3:23 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 19 2021

anlambert added a revision to T3127: Compute and display distribution of origins by forge: D6007: common/utils: Wrap deposits list retrieval in a function.
Jul 19 2021, 5:29 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

I think we could also get an accurate count of deposit origins (HAL, IPOL) using swh-deposit API

Jul 19 2021, 3:54 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 16 2021

anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Only one nit about the display. Using modal windows/popover will mean that there will be no easy way to have, as a user, the full list: one will have to click on each logo one by one, which could be quite annoying. Would it be possible to have a page with a rendering of the table above? (not sure if we want all columns, but at least the last update time and the number of origins per forge instance looks relevant and interesting to me). It coule be either in addition of what you propose (e.g., as a "coverage details" link, leading to the full page), or as a replacement of it (e.g., by making each forge icon just a link to the relevant anchor within the table on the "coverage details" page).

Jul 16 2021, 11:43 AM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
zack added a comment to T3127: Compute and display distribution of origins by forge.

Thanks for this update, great work!

Jul 16 2021, 11:29 AM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 13 2021

anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Some reports of what have been done so far and some future directions regarding the display of those data in swh-web.

Jul 13 2021, 3:39 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Jul 9 2021

olasd changed the status of T3403: Use forge URL network location as default lister instance name, a subtask of T3127: Compute and display distribution of origins by forge, from Open to Work in Progress.
Jul 9 2021, 3:37 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert closed T3402: Deploy swh-counters v0.8.0 and backfill origins, a subtask of T3127: Compute and display distribution of origins by forge, as Wontfix.
Jul 9 2021, 2:34 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert closed T3402: Deploy swh-counters v0.8.0 and backfill origins as Wontfix.

Precise metrics about listed origins and their counts will be retrieved from the scheduler database, no need to backfill origins with swh-counters then, closing this.

Jul 9 2021, 2:34 PM · Counters, System administration, Metrics/monitoring

Jun 23 2021

olasd added a comment to T3127: Compute and display distribution of origins by forge.

As @olasd said in a previous comment, even if we compute the metrics, we will miss counters about origins not tight to a lister
(googlecode and gitorious for instance). So I am thinking again about an hybrid approach using the swh-counters metrics
implemented yersteday which gives a rough estimation on the number of origins by network location (as visit statuses are not
processed, only origins) and the scheduler metrics.

Jun 23 2021, 9:16 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

I guess the cli to update metrics is executed periodically in production ?

I don't think that they are yet but that just got a priority increase now ;)

Jun 23 2021, 2:08 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
ardumont added a comment to T3127: Compute and display distribution of origins by forge.

I guess the cli to update metrics is executed periodically in production ?

Jun 23 2021, 1:59 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

The existing scheduler metrics are probably not complete enough for all we want to display (we should review them so they are), but the swh.scheduler journal client already gathers all the information needed, so we > should be able to compute all that we need from the scheduler tables.

Jun 23 2021, 12:49 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

After more thoughts about all those metrics, we could revamp the coverage widget into two tabs:

  • one tab displaying metrics about loaded origins with detailed counts by forge and links to search interface to browse them
  • one tab displaying metrics about listed origins from the data extracted from the scheduler database
Jun 23 2021, 12:13 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.
Jun 23 2021, 12:05 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

@anlambert @rdicosmo

For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1]
to compute stats about what we want scheduler side.

What's missing implementation wise would be to expose an endpoint to actually display said information.

So, the question is, even though the implementation swh.counter started, do we really want that there
or this ^ scheduler side would be enough?

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/cli/origin.py$148-182

Jun 23 2021, 12:04 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
olasd added a comment to T3127: Compute and display distribution of origins by forge.

Sorry @anlambert, I was late at Monday's meeting and I completely missed this in your weekly plan, I would have pointed this out earlier.

Jun 23 2021, 12:04 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
ardumont added a comment to T3127: Compute and display distribution of origins by forge.

For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1]
to compute stats about what we want scheduler side.

Jun 23 2021, 11:53 AM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert triaged T3402: Deploy swh-counters v0.8.0 and backfill origins as Normal priority.
Jun 23 2021, 11:13 AM · Counters, System administration, Metrics/monitoring

Jun 22 2021

anlambert added a revision to T3127: Compute and display distribution of origins by forge: D5910: journal_client: Add origins processing.
Jun 22 2021, 4:50 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a revision to T3127: Compute and display distribution of origins by forge: D5907: interface: Add get_listers method.
Jun 22 2021, 2:36 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Nice to see this moving forward!

These entries in the counter log look suspicious, though, they are not origins:

b'atlassian@bitbucket.org' 2
b'taylorhakes@github.com' 2
b'bunnyhero@bitbucket.org' 1
b'dtrebbien@bitbucket.org' 1
b'eldargab@github.com' 1
b'git@github.com' 1
b'schierlm@git.code.sf.net' 1
b'tomakehurst@github.com' 1
b'wenshao@github.com' 1
b'zimbra-mirror@bitbucket.org' 1
Jun 22 2021, 2:05 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
rdicosmo added a comment to T3127: Compute and display distribution of origins by forge.

Nice to see this moving forward!

Jun 22 2021, 1:59 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

Regarding this, to ease the mapping between a lister and an instance name, we may want to rework the instance names in the scheduler
model (listers table) so that the value is actually the netloc of the origin.

Jun 22 2021, 12:18 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
ardumont added a comment to T3127: Compute and display distribution of origins by forge.

Great work! Awesome.

Jun 22 2021, 12:16 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
anlambert added a comment to T3127: Compute and display distribution of origins by forge.

After some analysis, the data we need to properly implement this are:

  • the set of lister names and their instance names in order to organize origins by forge types (gitlab, cgit, sourceforge, ...)
  • a precise or estimated count for the origins listed by a given lister instance
Jun 22 2021, 12:07 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

May 28 2021

ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

Now what's missing here (not sure how hard it is) is the mean and max ingestion time
of save code now requests (time between they being accepted and the loader task is
over)

May 28 2021, 11:54 AM · Save Code Now, System administration, Metrics/monitoring

Apr 23 2021

vlorentz assigned T1363: Have metrics in prometheus for each tracked forge to olasd.
Apr 23 2021, 4:52 PM · Roadmap 2021, Metrics/monitoring, System administration
vlorentz assigned T3127: Compute and display distribution of origins by forge to anlambert.
Apr 23 2021, 4:52 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Apr 20 2021

ardumont added a project to T1481: add metric to monitor "save code now" efficiency: Save Code Now.
Apr 20 2021, 4:42 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Apr 20 2021, 4:35 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Since there is already a graph dedicated to pending requests, then pending reas should just be removed from the submitted reas graph.

Apr 20 2021, 4:26 PM · Save Code Now, System administration, Metrics/monitoring
douardda added a comment to T1481: add metric to monitor "save code now" efficiency.

Note that there is the same transient vs cumulative discrepency on the "Accepted requests" graph.

Apr 20 2021, 11:06 AM · Save Code Now, System administration, Metrics/monitoring
douardda added a comment to T1481: add metric to monitor "save code now" efficiency.

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Apr 20 2021, 11:02 AM · Save Code Now, System administration, Metrics/monitoring
douardda added a comment to T1481: add metric to monitor "save code now" efficiency.

I think the "submitted requests per visit type / status" graph should be split in 2 parts. Both accepted and rejected are cumulative values that will indefinitely grow, while pending are transient value aiming at staying near zero, so it makes no sense to have them on the same graph.

Apr 20 2021, 11:00 AM · Save Code Now, System administration, Metrics/monitoring

Apr 12 2021

ardumont moved T1481: add metric to monitor "save code now" efficiency from Backlog to deployed/landed/monitoring on the System administration board.
Apr 12 2021, 3:56 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a project to T1481: add metric to monitor "save code now" efficiency: System administration.
Apr 12 2021, 3:56 PM · Save Code Now, System administration, Metrics/monitoring

Apr 9 2021

ardumont claimed T1481: add metric to monitor "save code now" efficiency.
Apr 9 2021, 4:14 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

I've tentatively updated the save code now dashboard [1]
with that ^ new metric deployed in staging and production instances.

Apr 9 2021, 4:01 PM · Save Code Now, System administration, Metrics/monitoring

Apr 8 2021

ardumont added a revision to T1481: add metric to monitor "save code now" efficiency: D5463: Add metric to monitor "save code now" efficiency.
Apr 8 2021, 4:33 PM · Save Code Now, System administration, Metrics/monitoring

Apr 7 2021

ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

As a heads up, we can already determine some basic metrics out of the postgres db.

Apr 7 2021, 4:03 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

process a "save code now" request (including "take snapshot now")

Apr 7 2021, 3:18 PM · Save Code Now, System administration, Metrics/monitoring
ardumont added a comment to T1481: add metric to monitor "save code now" efficiency.

The archive computes its own prometheus metrics regarding save code now [1].
Also, the save code now model exposes a request_date and a visit_date [2].
So a first approximation on this would be to use those 2 fields and expose a new adapted metric.

Apr 7 2021, 12:54 PM · Save Code Now, System administration, Metrics/monitoring

Mar 15 2021

vlorentz triaged T3127: Compute and display distribution of origins by forge as Normal priority.
Mar 15 2021, 12:28 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Mar 14 2021

rdicosmo updated subscribers of T3127: Compute and display distribution of origins by forge.
Mar 14 2021, 8:00 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task
rdicosmo edited projects for T1363: Have metrics in prometheus for each tracked forge, added: Roadmap 2021; removed Restricted Project.
Mar 14 2021, 7:58 PM · Roadmap 2021, Metrics/monitoring, System administration
rdicosmo created T3127: Compute and display distribution of origins by forge.
Mar 14 2021, 7:56 PM · Metrics/monitoring, Web app, Roadmap 2021, meta-task

Mar 4 2021

rdicosmo added a parent task for T1481: add metric to monitor "save code now" efficiency: T3082: Improve Save Code Now handling.
Mar 4 2021, 10:36 AM · Save Code Now, System administration, Metrics/monitoring

Feb 10 2021

ardumont moved T2787: Improve access_logs parsing from in-progress to done on the System administration board.
Feb 10 2021, 7:06 PM · System administration, Metrics/monitoring

Feb 5 2021

vsellier closed T2787: Improve access_logs parsing as Resolved.

It seems there were some huge queries the last few days [1], the script needed to be adapted to use Long instead of Integers :

apache_logs-2021.01.14:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
          "java.base/java.lang.Integer.parseInt(Integer.java:652)",
          "java.base/java.lang.Integer.parseInt(Integer.java:770)",
          "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
          "                                                                                                ^---- HERE"
        ],
        "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
        "lang" : "painless",
        "position" : {
          "offset" : 96,
          "start" : 0,
          "end" : 125
        }
      }
    ],
    "type" : "script_exception",
    "reason" : "runtime error",
    "script_stack" : [
      "java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:68)",
      "java.base/java.lang.Integer.parseInt(Integer.java:652)",
      "java.base/java.lang.Integer.parseInt(Integer.java:770)",
      "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ",
      "                                                                                                ^---- HERE"
    ],
    "script" : "ctx._source.bytes = ctx._source.bytes instanceof java.lang.String ? Integer.parseInt(ctx._source.bytes) : ctx._source.bytes; ctx._source.response = ctx._source.response instanceof java.lang.String ? Integer.parseInt(ctx._source.response) : ctx._source.response;",
    "lang" : "painless",
    "position" : {
      "offset" : 96,
      "start" : 0,
      "end" : 125
    },
    "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: \"4633815064\""
    }
  },
  "status" : 400
}
Feb 5 2021, 9:09 AM · System administration, Metrics/monitoring

Feb 4 2021

vsellier added a comment to T2787: Improve access_logs parsing.

The opened apache indexes are currently being migrated with the P940's script.

Feb 4 2021, 8:12 PM · System administration, Metrics/monitoring
vsellier added a comment to T2787: Improve access_logs parsing.

The log parsing is ok.
An elasticsearch datasource was created on grafana so we can now create some graphs based on the logs on elasticsearch.
A simple dashboard to display some statistics based on the apache log was initiated[1], it appears the design is not as simple as in kibana and have some limitations but it still allows to have basic information centralized in grafana.

Feb 4 2021, 10:42 AM · System administration, Metrics/monitoring

Feb 2 2021

vsellier added a revision to T2787: Improve access_logs parsing: D5000: deposit: add request duration on access logs.
Feb 2 2021, 7:05 PM · System administration, Metrics/monitoring
vsellier added a comment to T2787: Improve access_logs parsing.

Configuration deployed for the webapp on all servers, the logs have now the duration, which is parsed on the elasticseach entries :

Feb 2 2021, 3:39 PM · System administration, Metrics/monitoring
vsellier added a revision to T2787: Improve access_logs parsing: D4989: Add request durations in access logs and improve logstash's integer parsing.
Feb 2 2021, 9:55 AM · System administration, Metrics/monitoring

Jan 29 2021

vsellier added a revision to T2787: Improve access_logs parsing: D4974: logstash: fix first puppet run and configuration updates.
Jan 29 2021, 5:05 PM · System administration, Metrics/monitoring
vsellier changed the status of T2787: Improve access_logs parsing from Open to Work in Progress.
Jan 29 2021, 2:34 PM · System administration, Metrics/monitoring
vsellier added a project to T2787: Improve access_logs parsing: System administration.
Jan 29 2021, 2:33 PM · System administration, Metrics/monitoring

Nov 17 2020

vsellier added a comment to T2733: Explore / install a varnish prometheus probe.

The varnish logs should be also ingested to elasticsearch to have fine grained statistics.

Nov 17 2020, 2:42 PM · Metrics/monitoring, System administration
vsellier triaged T2787: Improve access_logs parsing as Normal priority.
Nov 17 2020, 12:36 PM · System administration, Metrics/monitoring
vsellier added a project to T2733: Explore / install a varnish prometheus probe: Metrics/monitoring.
Nov 17 2020, 11:54 AM · Metrics/monitoring, System administration

Nov 3 2020

ardumont moved T1490: Use origin url on external-id attribute on deposit admin page from Backlog to Archived on the SWORD deposit board.
Nov 3 2020, 4:07 PM · Metrics/monitoring, SWORD deposit

Oct 26 2020

douardda closed T1370: Report key code metrics in prometheus as Resolved.
Oct 26 2020, 12:30 PM · Metrics/monitoring, Restricted Project, Continuous Integration, System administration

Oct 16 2020

ardumont added a comment to T2087: Create script to test SWORD deposit on SWH.

This can be closed now.

Oct 16 2020, 11:57 AM · Metrics/monitoring, SWORD deposit

Sep 22 2020

olasd added a comment to T1461: Add loader-related metrics to swh-loader-core.

I think the second point mostly happened: the storage is returning statistics to the loader, but the loaders don't generally collect them.

Sep 22 2020, 6:13 PM · Core Loader, Metrics/monitoring
olasd updated the task description for T1461: Add loader-related metrics to swh-loader-core.
Sep 22 2020, 6:11 PM · Core Loader, Metrics/monitoring
olasd updated the task description for T1461: Add loader-related metrics to swh-loader-core.
Sep 22 2020, 6:10 PM · Core Loader, Metrics/monitoring
olasd placed T1461: Add loader-related metrics to swh-loader-core up for grabs.
Sep 22 2020, 6:10 PM · Core Loader, Metrics/monitoring
olasd closed T1435: Improve swh-scheduler prometheus metrics, a subtask of T1408: More/better Metrics, as Resolved.
Sep 22 2020, 6:09 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1435: Improve swh-scheduler prometheus metrics as Resolved.
Sep 22 2020, 6:09 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1438: Add labels to prometheus metrics to help queries as Resolved.

We've definitely improved on this (notably using proper hostnames for the instance label on prom metrics). I think we should make this task more actionable if we want to keep it open.

Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12
olasd closed T1438: Add labels to prometheus metrics to help queries, a subtask of T1408: More/better Metrics, as Resolved.
Sep 22 2020, 6:08 PM · Metrics/monitoring, Sprint 2018 12

Apr 21 2020

olasd closed T1270: Investigate an application monitoring tool to automate error detection in our workers as Resolved.

I'm pretty sure this is done now ;p

Apr 21 2020, 11:36 AM · Metrics/monitoring, Development environment

Feb 15 2020

vlorentz moved T2175: Deploy swh-icinga-plugins from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
vlorentz moved T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services from Backlog to deployed on the Sprint 2019/12 (Monitor and Conquer) board.
Feb 15 2020, 8:18 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 27 2020

vlorentz added a comment to T1365: Archive coverage metrics in prometheus.

https://grafana.softwareheritage.org/d/3SAW_JEmk/software-heritage-archive-counters

Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus, a subtask of T1364: Have production metrics in prometheus or kibana, as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project
vlorentz closed T1365: Archive coverage metrics in prometheus as Resolved.
Jan 27 2020, 4:44 PM · Metrics/monitoring, Restricted Project

Jan 23 2020

ardumont closed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services as Resolved.

Deployed.

Jan 23 2020, 12:09 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont added a parent task for T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: T2238: Configure Sentry environments.
Jan 23 2020, 11:13 AM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration

Jan 22 2020

ardumont added a revision to T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services: D2576: sentry: Define setup for swh services (servers, workers, ...).
Jan 22 2020, 6:50 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
vlorentz added a project to T2228: Metrics and monitoring: Metrics/monitoring.
Jan 22 2020, 4:27 PM · Metrics/monitoring, Roadmap 2020
ardumont claimed T2181: Set SWH_MAIN_PACKAGE and SWH_SENTRY_ENVIRONMENT for all services.

Adapting the puppet manifest so we can discriminate issues per environment in sentry.

Jan 22 2020, 4:13 PM · Metrics/monitoring, Sprint 2019/12 (Monitor and Conquer), System administration
ardumont closed T2175: Deploy swh-icinga-plugins, a subtask of T1011: Enable continuous monitoring of deposit, as Resolved.
Jan 22 2020, 3:29 PM · Metrics/monitoring, SWORD deposit
ardumont closed T2175: Deploy swh-icinga-plugins as Resolved.
Jan 22 2020, 3:29 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Vault check deployed!

Jan 22 2020, 3:28 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

Deposit check deployed!

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring
ardumont added a comment to T2175: Deploy swh-icinga-plugins.

debian package this

Jan 22 2020, 2:12 PM · Sprint 2019/12 (Monitor and Conquer), System administration, Metrics/monitoring